Introduction

MSc Machine Learning Assignment - Classification task. Private Kaggle Competition: "Are you sure Brighton's seagull is not a man-made object?" The aim of the assignment was to build a classifier able to distinguish between man-made and not man-made objects. Each data instance was represented by a 4608 dimensional feature vector. This vector was a concatenation of 4096 dimensional deep Convolutional Neural Networks (CNNs) features extracted from the fc7 activation layer of CaffeNet and 512 dimensional GIST features.

Three Additional pieces of information were granted: confidence label for each training instance, the test data class proportions, and additional training data containing missing values.

This Notebook contains the final workflow employed, to produce the model used to make the final predictions. The original Notebook contained a lot of trial and error methods, such as fine tuning the range of parameters fitted in the model. Some of these details from the original, and rather messy Notebook have been excluded here. Thus, this Notebook only intends to show the code for the key processes leading up to the development of the final model. In addition, this notebook contains a report/commentary documenting the theory concerning the steps of the workflow. The theory is adopted from the Literature, and is referenced appropriately.

Report also available in PDF, contact me for request.

1. Approach

1.1) Introduction of SVM

The approach of choice here was the Support Vector Machine (SVM). SVMs were pioneered in the late seventies [1]. SVMs are supervised learning models, which are extensively used for classification [2] and regression tasks [3]. In this context, SVM was employed for a binary classification task.

In layman’s terms, the basic premise of SVM for classification tasks is to find the optimal separating hyperplane (also called the decision boundary) between classes, through maximizing the margin between the data points closest to the decision boundary and the decision boundary itself. The points closest to the decision boundary are termed support vectors. The reason why the margin is maximized is to improve generalisation of the decision boundary; it is likely that many decision boundaries exist, but the decision boundary that maximizes the margin increases the likelihood that future outliers will be correctly classified [4]. This intuition seems relatively simple, but is complicated by ‘soft’ and ‘hard’ margins. A hard margin is only applicable when the data set is linearly separable. A soft margin is applicable when the data set is not linearly separable. Essentially, a soft margin is aware of future misclassifications due to the data not being linearly separable, thus tolerates misclassifications using a penalty term. On the contrary, a hard margin does not tolerate misclassifications. For a hard margin, misclassifications are dealt with by minimizing the margin. These concepts will be formalized below.

Assume the problem of binary classification on a dataset $\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$ , where $x_i$ $\in$ $R^d$, i.e. $x_i$ is a data point represented as a d-dimensional vector, and $y_i$ $\in$ $\{-1, 1\}$ , which represents the class label of that data point, for $i= 1, 2, ..., n$ . A better optimal separation can be found by first transforming the data into a higher dimensional feature space by a non-linear mapping function $\phi$ [2]. This $\phi$ is also referred to as the ‘kernel’. A possible decision boundary can then be represented by $w \cdot \phi(x) + b = 0$ , where $w$ is the weight vector orthogonal to the decision boundary and $b$ is an intercept term. It follows that, if the data set is linearly separable, then the decision boundary that maximizes the margin can be found by solving the following optimization: $\min (\frac{1}{2} w \cdot w) $ under the constraint $y_i (w \cdot \phi(x_i) + b) \ge 1 $ where $i = 1, 2, ..., n$ . This encapsulates the concept of a ‘hard’ margin. However, in the case of non-linearly separable data, the above constraint has to be relaxed by the introduction of a slack variable $\varepsilon$ . The optimization problem then becomes: $\min(\frac{1}{2} w \cdot w + C \sum_{i=1}^n \varepsilon_i$) such that $y_i (w \cdot \phi(x_i) + b) \ge 1 - \varepsilon_i $ where $i = 1, 2, ..., n$ and $\varepsilon_i \ge 0$. The $\sum_{i=1}^n \varepsilon_i$ term can be interpreted as the misclassification cost. This new objective function comprises two aims. The first aim still remains to maximize the margin, and the second aim is to reduce the number of misclassifications. The trade-off between these two aims is controlled by the parameter $C$. This encapsulates the concept of a ‘soft’ margin. $C$ is coined the regularization parameter. A high value of $C$ increases the penalty for misclassifications, thus places more emphasis on the second goal. A large misclassification penalty enforces the model to reduce the number of misclassifications. Hence, a high enough value of $C$ could induce over-fitting. A small $C$ decreases the penalty for misclassifications, thus places more emphasis on the first goal. A small classification penalty enforces the model to tolerate classifications more readily. Hence, a small enough value of $C$ could induce under-fitting.
The SVM classifier is trained using the hinge-loss as the loss function [5].

1.2) Suitability of SVM

SVM is a popular technique because of its solid mathematical background, high generalisation capability, ability to find global solutions and ability to find solutions that are non-linear [6]. However, SVMs can be impacted by data sets that do not have equal class balances. The methods for dealing with this will be discussed in Section 2.1 of this report. Thus, SVMs are still applicable to data sets with class imbalances, such as the data set provided here. It has been argued that SVMs show superior performance than other techniques when analysis is conducted on high- dimensional data [7]. The dataset here, even after pre- processing, has many dimensions. Thus, the use of SVM in this context is justified. Another downfall of SVM is the dependency on feature scaling; the performance of an SVM can be highly impacted by the selection of the feature scaling method. However, feature scaling is an important pre-processing technique. One encouraging reason for employing feature scaling is that the gradient descent algorithm converges much faster with feature scaling than without feature scaling. In particular, feature scaling reduces the time it takes for the SVM to find support vectors [8].

2. Data Preparation Before Pre-Processing

This section will cover how the training data for the final model was prepared. Several additonal pieces of information were provided in the assignment outline. This section will demonstrate how these strands of information were incorporated, if they were incorporated at all.


In [1]:
#Import Relevant Modules and Packages 
import pandas as pd
import numpy as np 
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from scipy import stats
from sklearn.feature_selection import VarianceThreshold
#see all rows of dataframe
#pd.set_option('display.max_rows', 500)

In [2]:
#Load the complete training data set 
training_data = pd.read_csv("/Users/Max/Desktop/Max's Folder/Uni Work/Data Science MSc/Machine Learning/ML Kaggle Competition /Data Sets/Training Data Set.csv", header=0, index_col=0)

In [3]:
#Observe the original training data 
training_data.head()


Out[3]:
CNNs CNNs.1 CNNs.2 CNNs.3 CNNs.4 CNNs.5 CNNs.6 CNNs.7 CNNs.8 CNNs.9 ... GIST.503 GIST.504 GIST.505 GIST.506 GIST.507 GIST.508 GIST.509 GIST.510 GIST.511 prediction
ID
1 0.00000 0.16784 1.477 0.75651 0.38741 0.0000 0.21295 0.0000 0.000000 0.225750 ... 0.025833 0.021306 0.027640 0.036184 0.047010 0.037981 0.049249 0.059802 0.035669 0
2 0.00000 0.00000 0.000 0.44260 0.00000 0.0000 0.15024 1.4806 0.635870 0.020341 ... 0.017774 0.020330 0.019916 0.033483 0.015937 0.021656 0.018347 0.017458 0.018744 0
3 0.00000 0.00000 0.000 0.47042 0.00000 1.2779 0.45954 0.0000 0.000000 0.000000 ... 0.017935 0.005156 0.041298 0.014921 0.015868 0.012122 0.015664 0.011410 0.017450 1
4 0.00000 0.00000 0.000 0.00000 0.00000 0.0000 0.00000 0.0000 0.030878 0.928510 ... 0.039596 0.007086 0.013696 0.028789 0.022858 0.030883 0.026539 0.021337 0.018109 1
5 0.49099 0.83388 0.000 0.00000 0.00000 0.0000 0.00000 0.0000 0.188490 0.764420 ... 0.008161 0.036306 0.029198 0.045733 0.008041 0.013111 0.022239 0.058815 0.014322 1

5 rows × 4609 columns


In [4]:
#quantify class counts of original training data 
training_data.prediction.value_counts()


Out[4]:
1    205
0    175
Name: prediction, dtype: int64

2.1) Dealing with Missing Values – Imputation

Imputation is the act of replacing missing data values in a data set with meaningful values. Simply removing rows with missing feature values is bad practice if the data is scarce, as a lot of information could be lost. In addition, deletion methods can introduce bias [9]. The incomplete additional training data was combined with the complete original training data because the complete original data was scarce in number. However, the incomplete additional training data was missing values, therefore imputation was appropriate, if not required. Two methods of imputation were employed. The first method of imputation employed was imputation via feature means. However, this method has been heavily criticized. In particular, it has been hypothesized that imputation via mean introduces bias and underestimates variability [10]. The second method of imputation employed was k- Nearest-Neighbours (kNN) [11]. This is a technique, which is part of hot-deck imputation techniques [12], where missing feature values are filled from data points that are similar, or geometrically speaking, points that are closest in distance. This method is more appropriate than using the mean imputation, given the flaws of feature mean imputation. Therefore, kNN was the imputation method used to build the final model. The kNN implementation was found in the ‘fancyimpute’ package [13]. The k of kNN can be considered a parameter that needs to be chosen carefully. Fortunately, the literature provides some direction on this. The work of [14] suggests that kNN with 3 nearest-neighbours is the best for the trade-off between imputation error and preservation of data structure. In summary, kNN was employed for imputation, and k was set to 3.

This section will cover how the incomplete additional training data set was incorporated to develop a larger training data set. In particular, the additional training data was combined with the original training data. The additonal training data was incomplete, with several NaN entries. Thus, imputation was performed to replace NaN entries with meaningful values.


In [5]:
#Load additional training data 
add_training_data = pd.read_csv("/Users/Max/Desktop/Max's Folder/Uni Work/Data Science MSc/Machine Learning/ML Kaggle Competition /Data Sets/Additional Training Data Set .csv", header=0, index_col=0)

In [6]:
#observe additional training data 
add_training_data


Out[6]:
CNNs CNNs.1 CNNs.2 CNNs.3 CNNs.4 CNNs.5 CNNs.6 CNNs.7 CNNs.8 CNNs.9 ... GIST.503 GIST.504 GIST.505 GIST.506 GIST.507 GIST.508 GIST.509 GIST.510 GIST.511 prediction
ID
381 0.36854 0.000000 NaN 0.000000 0.00000 NaN 0.360540 0.659070 0.907850 1.207200 ... 0.016446 0.014056 0.043309 NaN 0.037343 0.006719 NaN 0.024491 0.015025 1
382 0.00000 0.000000 0.000000 1.194500 1.10800 1.443800 0.000000 0.718870 NaN NaN ... 0.005913 NaN 0.006697 0.003553 0.004236 NaN 0.008944 0.020299 NaN 0
383 0.33315 0.000000 0.000000 0.000000 0.00000 0.000000 0.103570 0.094568 NaN 0.000000 ... 0.028948 NaN NaN 0.032514 0.031939 0.048637 0.047683 0.051014 0.023250 1
384 0.00000 0.000000 0.000000 0.000000 0.00000 0.425680 0.674630 0.000000 0.797180 0.441840 ... 0.048603 0.012979 0.039932 0.024701 0.027439 0.014231 0.044304 0.057307 0.025580 1
385 0.00000 0.000000 NaN 0.000000 NaN 0.000000 0.000000 0.536480 0.000000 1.662200 ... 0.063778 0.014582 0.094946 0.072355 0.036569 0.019885 0.075454 0.057808 NaN 1
386 0.00000 0.000000 0.000000 0.000000 NaN 0.033839 NaN 0.550010 0.000000 0.000000 ... 0.002547 0.000970 0.004847 NaN NaN 0.001280 0.010284 0.017465 0.003835 0
387 0.89589 2.249300 0.000000 0.096874 0.00000 0.000000 NaN 0.000000 0.000000 0.000000 ... 0.021326 0.048036 NaN 0.062308 NaN 0.029789 0.023238 0.049789 0.042957 1
388 0.00000 NaN 0.000000 NaN 0.15435 0.000000 0.345190 0.790570 0.000000 0.603140 ... 0.020851 0.054987 0.048095 0.054826 0.044836 0.047492 0.031173 NaN 0.056663 1
389 2.07390 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 NaN 0.000000 NaN ... 0.069224 NaN 0.031577 0.050929 0.039233 0.012247 0.035869 0.065476 0.036050 1
390 0.00000 2.013900 NaN 0.000000 0.00000 0.000000 0.551610 0.000000 0.770090 0.000000 ... 0.010342 0.069422 0.073940 0.042869 0.014791 0.046599 NaN NaN 0.012851 1
391 NaN NaN 0.000000 0.000000 NaN 0.000000 0.000000 0.449090 0.000000 0.000000 ... 0.017124 NaN 0.037151 0.066304 0.030059 0.017258 0.057827 0.046734 0.013858 0
392 0.00000 NaN 0.551660 0.511390 0.44663 0.000000 0.829450 0.169020 0.000000 0.656110 ... NaN NaN 0.002628 0.001312 0.003053 0.042552 0.016416 NaN 0.009700 1
393 0.00000 1.758300 0.000000 0.000000 0.70794 0.000000 1.016600 NaN 0.223690 0.000000 ... 0.002593 NaN 0.041875 NaN 0.002247 NaN NaN NaN 0.004210 1
394 0.00000 0.000000 0.000000 NaN 1.69510 NaN 0.000000 0.000000 NaN 0.387880 ... 0.001936 0.009446 0.057476 0.030121 0.013046 NaN 0.039464 NaN 0.029116 1
395 0.00000 0.000000 0.065205 NaN 0.00000 0.000000 0.165250 0.000000 NaN NaN ... 0.044815 0.025457 0.033375 0.057896 0.033652 NaN 0.020977 0.025914 0.018434 1
396 0.16352 0.091154 NaN 0.127380 NaN NaN 0.000000 0.000000 1.230700 0.600490 ... 0.027672 0.006230 0.010304 0.005203 NaN 0.004434 NaN 0.004102 0.015907 0
397 0.00000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.608360 1.405400 NaN ... 0.004091 NaN 0.021434 0.004124 0.006042 0.040793 0.042516 NaN 0.029567 1
398 0.00000 0.000000 0.000000 0.000000 0.00000 0.622480 0.000000 0.537490 1.016600 0.003190 ... 0.027232 NaN 0.042924 NaN 0.035156 0.038203 0.037265 0.023543 0.017802 1
399 0.00000 0.000000 0.000000 0.000000 0.00000 0.370100 0.132860 0.455330 0.182380 1.002500 ... 0.016653 0.044491 0.024781 0.041926 NaN NaN NaN 0.043103 0.052243 1
400 0.27942 NaN 0.000000 1.303200 0.00000 0.445950 0.000000 0.000000 0.664470 NaN ... 0.005121 0.004739 0.015804 0.009327 0.006316 NaN 0.015994 0.023798 NaN 1
401 0.00000 1.153500 0.000000 0.000000 NaN 0.607270 0.133640 0.000000 0.000000 0.000000 ... 0.021836 0.027867 0.044285 0.056987 0.024753 0.020699 0.016404 0.021481 NaN 1
402 0.00000 0.000000 0.000000 NaN 1.09950 0.000000 0.221540 NaN NaN 0.576960 ... 0.055716 0.021611 NaN 0.006325 0.024249 0.018989 0.010881 0.029069 0.035425 0
403 0.69578 1.076700 0.000000 0.000000 0.00000 0.422790 0.074301 NaN 0.000000 0.000000 ... 0.022869 0.041593 NaN NaN NaN 0.028789 NaN 0.034176 0.028748 1
404 0.00000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 1.911900 ... NaN 0.024381 0.026518 0.014335 0.007303 NaN NaN 0.029905 NaN 1
405 0.00000 0.000000 0.000000 0.095563 1.35120 0.000000 0.073423 0.530710 0.150650 0.000000 ... 0.001538 0.007162 0.009936 NaN 0.019254 0.010202 0.010988 0.007354 0.002542 0
406 NaN 0.456580 0.000000 NaN 0.00000 NaN 0.000000 0.088917 0.000000 0.574290 ... 0.028478 0.001726 0.001601 0.030189 0.028521 0.002605 0.003204 0.019569 0.018815 1
407 NaN 0.759570 0.602440 0.000000 NaN 1.053600 NaN 0.000000 0.137830 0.778610 ... 0.022562 0.005456 NaN 0.012345 0.017503 0.026339 0.041269 0.020342 0.015812 0
408 0.00000 0.000000 0.000000 NaN NaN 0.591770 0.171560 0.312710 NaN 0.000000 ... 0.024517 0.012467 NaN 0.026758 0.030351 0.025761 0.037172 0.019815 0.023609 0
409 NaN 0.000000 NaN 0.698550 0.00000 0.000000 0.197920 NaN 0.000000 0.000000 ... 0.056415 NaN NaN 0.030497 0.025438 0.047698 0.048943 0.050613 0.022563 0
410 0.00000 1.107400 NaN NaN 2.14430 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.010856 0.032030 0.047955 0.029015 0.014196 0.032521 0.035577 0.011103 0.010629 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3771 0.00000 0.000000 0.000000 0.000000 0.00000 0.613830 1.456800 1.350900 0.769510 0.000000 ... 0.042000 0.031909 0.012088 0.019042 0.013035 0.069337 0.040768 NaN 0.033300 1
3772 0.48447 0.773390 0.000000 0.213940 0.00000 0.000000 0.000000 0.308140 0.120880 NaN ... 0.020615 0.010171 0.024787 0.021214 0.015437 0.021795 0.032656 0.011645 0.012731 1
3773 1.52270 0.000000 0.000000 NaN 0.00000 NaN 0.000000 0.000000 0.000000 NaN ... 0.018538 0.010537 0.027684 0.031280 0.027294 0.013832 NaN 0.033723 0.039819 1
3774 NaN NaN 0.000000 0.000000 NaN NaN 0.000000 0.000000 NaN 0.000000 ... 0.037988 NaN 0.043092 0.031044 0.036253 0.059886 0.047193 0.065673 0.021502 1
3775 NaN NaN 0.000000 0.435930 0.00000 0.000000 0.000000 0.000000 NaN NaN ... 0.032696 0.034843 NaN 0.035841 0.018307 NaN 0.038746 0.038803 NaN 0
3776 0.52334 0.000000 0.000000 NaN 0.00000 0.000000 0.000000 0.174440 0.000000 0.650650 ... 0.042804 0.010049 0.012368 0.032397 0.040944 0.016676 NaN 0.023827 NaN 1
3777 0.00000 0.000000 0.000000 0.000000 0.00000 NaN 0.000000 0.000000 NaN 1.662700 ... 0.015793 0.022092 0.014955 0.018942 0.016944 0.032190 0.028367 0.018088 NaN 1
3778 NaN 0.000000 0.258000 0.507590 0.00000 0.000000 0.009438 0.000000 0.000000 0.143250 ... 0.017947 0.024456 NaN 0.041183 0.047993 0.033153 0.040062 0.023869 NaN 0
3779 0.00000 0.000000 NaN 0.000000 0.16713 0.000000 NaN 0.000000 0.000000 0.000000 ... 0.033826 0.056108 0.062568 NaN 0.028362 0.038791 0.040587 0.035817 0.017098 0
3780 0.00000 0.000000 NaN 0.000000 NaN 1.302700 0.000000 0.000000 NaN 0.000000 ... NaN NaN 0.053071 NaN 0.023068 0.002995 0.007839 NaN 0.014431 1
3781 0.00000 NaN 0.000000 0.283930 1.20350 0.017472 NaN 0.000000 NaN 0.364180 ... 0.018298 0.017468 NaN NaN 0.013216 0.018009 NaN 0.013381 0.009624 1
3782 0.08295 0.153360 0.000000 0.000000 NaN 0.000000 0.973910 NaN 0.043460 1.534700 ... 0.023122 0.011625 NaN 0.008095 0.011113 0.041680 0.019421 0.020782 0.012575 1
3783 0.00000 0.000000 0.000000 0.000000 0.00000 NaN NaN 0.000000 0.461830 1.176400 ... NaN 0.027828 0.038889 0.021408 0.003900 0.029600 0.029911 0.026131 0.011419 0
3784 0.00000 0.738330 NaN 0.000000 0.00000 NaN NaN 0.000000 0.000000 0.916750 ... 0.037651 0.049141 0.021031 0.028404 0.012977 0.039746 0.040718 0.013986 NaN 1
3785 NaN 0.190430 0.363980 NaN 0.00000 0.000000 NaN 0.383590 0.272360 0.000000 ... 0.036021 0.035155 0.031911 0.057594 0.080866 0.034773 0.047184 0.064976 NaN 1
3786 NaN NaN NaN 0.324010 0.00000 0.000000 0.000000 0.000000 NaN NaN ... NaN 0.005199 NaN 0.006255 0.008951 NaN NaN 0.010050 0.008300 1
3787 0.00000 0.000000 0.501170 0.057088 0.16317 0.646400 0.298280 0.757990 0.087317 0.146870 ... 0.021765 0.016235 0.044767 0.024993 0.010303 0.009179 0.031484 0.070180 0.046984 1
3788 0.00000 0.000000 NaN 0.000000 0.00000 NaN 0.825670 0.000000 NaN NaN ... 0.006477 0.011424 NaN 0.004331 NaN 0.008933 NaN 0.017482 0.031523 0
3789 0.00000 0.000000 0.000000 NaN 0.00000 1.338000 0.000000 0.000000 0.000000 0.113400 ... 0.006814 0.000684 0.001524 NaN 0.001442 0.002318 NaN 0.011766 0.007097 1
3790 0.00000 0.000000 0.000000 0.476500 NaN 0.176770 0.085610 0.752900 0.000000 0.350030 ... NaN NaN 0.021348 0.031300 NaN 0.014002 0.030157 0.037464 0.024539 0
3791 0.00000 0.317140 0.246310 0.000000 0.00000 NaN 0.000000 NaN 0.000000 NaN ... 0.020298 0.006256 0.015469 0.038902 NaN 0.020679 0.027800 0.035840 NaN 1
3792 0.00000 0.000000 0.000000 0.000000 0.00000 0.000000 NaN 0.754110 0.151220 0.133960 ... NaN NaN 0.056608 0.034379 NaN 0.050165 NaN 0.038335 NaN 1
3793 0.00000 NaN 1.137600 0.000000 0.00000 0.000000 0.084377 0.000000 0.000000 0.947300 ... NaN 0.037247 0.041796 0.026112 0.010530 0.046502 NaN 0.042460 NaN 0
3794 0.00000 0.537230 0.000000 0.000000 0.00000 1.846200 0.000000 0.000000 0.000000 1.628400 ... 0.030879 0.037300 NaN 0.017458 0.017493 0.027442 0.014037 0.011320 0.010543 0
3795 0.00000 0.000000 0.000000 0.262230 0.00000 0.000000 0.000000 NaN 0.000000 0.000000 ... 0.034215 0.043260 NaN 0.042648 0.044968 0.038014 NaN 0.054445 0.032456 0
3796 0.00000 0.944010 0.000000 0.000000 NaN 0.000000 0.000000 0.000000 0.000000 0.000000 ... NaN 0.021328 0.016077 0.019606 NaN 0.005605 0.003127 0.009222 0.019916 1
3797 NaN 0.000000 NaN NaN NaN 0.516080 0.000000 0.133720 0.359210 NaN ... 0.019692 0.013608 0.020126 0.021958 0.035866 0.025194 0.029437 NaN NaN 1
3798 NaN 0.000000 0.146570 0.000000 0.26073 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.022441 0.025916 0.040383 0.045961 0.012540 0.025097 NaN 0.030621 NaN 1
3799 0.00000 NaN 0.293200 0.000000 NaN 0.000000 0.262210 0.000000 0.000000 0.000000 ... 0.012463 0.024990 0.034452 0.014815 0.008251 0.058643 NaN 0.038955 0.010777 1
3800 0.00000 0.000000 0.000000 0.000000 0.00000 NaN 0.490140 NaN 1.968300 0.008118 ... 0.055913 0.009899 NaN 0.017555 0.019566 NaN 0.048685 NaN 0.028767 1

3420 rows × 4609 columns


In [7]:
#quantify class counts of additional training data 
add_training_data.prediction.value_counts()


Out[7]:
1    1995
0    1425
Name: prediction, dtype: int64

In [8]:
#find number of NAs for each column for additional training data
add_training_data.isnull().sum()


Out[8]:
CNNs          698
CNNs.1        653
CNNs.2        719
CNNs.3        648
CNNs.4        674
CNNs.5        693
CNNs.6        688
CNNs.7        710
CNNs.8        610
CNNs.9        680
CNNs.10       670
CNNs.11       693
CNNs.12       682
CNNs.13       681
CNNs.14       657
CNNs.15       704
CNNs.16       689
CNNs.17       668
CNNs.18       686
CNNs.19       706
CNNs.20       697
CNNs.21       672
CNNs.22       723
CNNs.23       654
CNNs.24       699
CNNs.25       672
CNNs.26       689
CNNs.27       648
CNNs.28       693
CNNs.29       712
             ... 
GIST.483      638
GIST.484      629
GIST.485      666
GIST.486      648
GIST.487      723
GIST.488      686
GIST.489      689
GIST.490      664
GIST.491      690
GIST.492      676
GIST.493      689
GIST.494      668
GIST.495      673
GIST.496      755
GIST.497      688
GIST.498      674
GIST.499      674
GIST.500      728
GIST.501      702
GIST.502      667
GIST.503      682
GIST.504      717
GIST.505      678
GIST.506      692
GIST.507      660
GIST.508      686
GIST.509      681
GIST.510      699
GIST.511      680
prediction      0
dtype: int64

In [9]:
#concatenate original training data with additional training data 
full_training_data_inc = pd.concat([training_data, add_training_data])
#observe concatenated training data 
full_training_data_inc


Out[9]:
CNNs CNNs.1 CNNs.2 CNNs.3 CNNs.4 CNNs.5 CNNs.6 CNNs.7 CNNs.8 CNNs.9 ... GIST.503 GIST.504 GIST.505 GIST.506 GIST.507 GIST.508 GIST.509 GIST.510 GIST.511 prediction
ID
1 0.000000 0.167840 1.477000 0.756510 0.387410 0.000000 0.212950 0.00000 0.000000 0.225750 ... 0.025833 0.021306 0.027640 0.036184 0.047010 0.037981 0.049249 0.059802 0.035669 0
2 0.000000 0.000000 0.000000 0.442600 0.000000 0.000000 0.150240 1.48060 0.635870 0.020341 ... 0.017774 0.020330 0.019916 0.033483 0.015937 0.021656 0.018347 0.017458 0.018744 0
3 0.000000 0.000000 0.000000 0.470420 0.000000 1.277900 0.459540 0.00000 0.000000 0.000000 ... 0.017935 0.005156 0.041298 0.014921 0.015868 0.012122 0.015664 0.011410 0.017450 1
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.030878 0.928510 ... 0.039596 0.007086 0.013696 0.028789 0.022858 0.030883 0.026539 0.021337 0.018109 1
5 0.490990 0.833880 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.188490 0.764420 ... 0.008161 0.036306 0.029198 0.045733 0.008041 0.013111 0.022239 0.058815 0.014322 1
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.025636 0.008809 0.026506 0.018506 0.029058 0.009211 0.013236 0.031606 0.022141 1
7 0.368230 0.000000 0.000000 0.000000 0.000000 0.395810 0.948560 0.00000 0.000000 0.000000 ... 0.031240 0.016638 0.040408 0.028362 0.016704 0.034409 0.025067 0.024614 0.026773 1
8 0.367450 0.000000 0.087409 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.588080 ... 0.052771 0.017765 0.055323 0.067212 0.048452 0.019376 0.056357 0.056325 0.050188 1
9 0.066494 0.000000 0.000000 0.084850 0.608320 0.522730 0.000000 0.37833 0.000000 0.096584 ... 0.020576 0.026661 0.021242 0.019962 0.040603 0.027398 0.019766 0.020432 0.032214 0
10 0.495670 2.536900 0.000000 0.000000 0.000000 1.530300 0.000000 0.00000 0.000000 0.000000 ... 0.011404 0.013138 0.025195 0.017418 0.010645 0.012981 0.039255 0.016495 0.007007 1
11 0.000000 0.000000 0.000000 0.000000 0.623180 0.524910 0.000000 1.34940 0.000000 0.000000 ... 0.024555 0.005029 0.031665 0.040577 0.026261 0.023069 0.043602 0.044524 0.066983 0
12 1.096500 0.720820 0.418210 0.000000 0.312950 0.000000 0.000000 0.00000 0.000000 0.464130 ... 0.013453 0.034157 0.045233 0.044563 0.017900 0.043618 0.076412 0.036831 0.007185 1
13 0.653430 0.000000 0.142020 0.046679 0.000000 0.000000 0.650850 0.00000 0.000000 0.000000 ... 0.047733 0.043096 0.055382 0.050194 0.039210 0.023657 0.021919 0.055182 0.027263 0
14 0.000000 0.000000 0.000000 0.000000 1.299300 0.882630 0.137290 0.00000 0.000000 0.790330 ... 0.002243 0.018213 0.030337 0.011113 0.003206 0.036056 0.024078 0.020279 0.022261 0
15 0.000000 1.103800 0.000000 0.000000 1.145400 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.041530 0.036399 0.031912 0.029357 0.056295 0.012967 0.021085 0.042250 0.037959 0
16 0.658060 1.518300 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 2.115800 ... 0.039075 0.024941 0.018340 0.025604 0.014863 0.038411 0.051662 0.075225 0.031801 1
17 0.915600 0.000000 0.000000 0.000000 0.000000 0.000000 0.827120 0.00000 0.000000 0.000000 ... 0.030439 0.048584 0.034337 0.049318 0.021250 0.035184 0.032242 0.025961 0.023896 0
18 0.122340 0.484180 0.575630 0.000000 0.056843 0.000000 0.269810 0.80379 0.000000 0.847900 ... 0.043903 0.025202 0.036941 0.101960 0.042061 0.033733 0.056107 0.043020 0.034273 1
19 0.000000 0.000000 0.887790 0.796050 0.949070 0.000000 0.000000 0.00000 0.000000 0.150800 ... 0.062996 0.046529 0.029277 0.048688 0.030056 0.066896 0.064681 0.064771 0.033705 0
20 0.000000 0.911010 0.000000 0.000000 0.336420 0.000000 0.000000 0.18094 0.263610 0.988440 ... 0.008856 0.041469 0.009145 0.009094 0.009796 0.027103 0.042893 0.056196 0.012501 1
21 0.000000 0.000000 0.000000 0.198800 0.000000 0.901150 0.000000 0.00000 0.413540 0.000000 ... 0.020464 0.037698 0.028640 0.023290 0.030678 0.032787 0.049547 0.027087 0.044812 0
22 0.457150 1.565800 0.000000 0.000000 0.000000 0.719880 0.000000 0.00000 0.107970 0.000000 ... 0.036841 0.060575 0.041959 0.047929 0.039467 0.049774 0.082377 0.057957 0.035604 1
23 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.019534 0.006583 0.012513 0.028868 0.039953 0.014977 0.045486 0.021237 0.026507 0
24 1.145900 1.242200 0.000000 0.000000 0.000000 0.956940 0.000000 0.00000 0.000000 0.519750 ... 0.039259 0.018149 0.012153 0.018000 0.037783 0.019433 0.014982 0.025554 0.025585 1
25 0.384660 0.642160 0.000000 0.000000 0.000000 0.000000 0.141730 0.51014 0.000000 0.322540 ... 0.034577 0.031162 0.038961 0.041850 0.026021 0.007156 0.031507 0.048640 0.028067 1
26 0.044793 0.171220 0.150470 0.791640 0.100370 0.000000 0.000000 0.00000 0.000000 0.546100 ... 0.059782 0.027918 0.022951 0.067286 0.072825 0.026592 0.030550 0.047546 0.061572 0
27 0.024028 0.000000 0.000000 1.560600 0.323740 0.573730 0.000000 0.97208 0.402180 0.613800 ... 0.019090 0.011114 0.007980 0.045997 0.045494 0.018651 0.011630 0.011288 0.019492 0
28 0.311980 0.244520 0.212100 0.978550 0.000000 1.319800 0.000000 0.00000 0.000000 0.026332 ... 0.033687 0.066910 0.036916 0.029357 0.017351 0.020543 0.015300 0.016477 0.019715 0
29 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.55810 0.485880 0.815440 ... 0.039693 0.020571 0.044319 0.039485 0.047731 0.043560 0.043651 0.042374 0.048968 0
30 0.269790 0.024128 0.000156 0.000000 0.174980 0.000000 0.337810 0.00000 0.000000 0.000000 ... 0.011549 0.085321 0.016958 0.008131 0.022019 0.031845 0.020188 0.007039 0.012079 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3771 0.000000 0.000000 0.000000 0.000000 0.000000 0.613830 1.456800 1.35090 0.769510 0.000000 ... 0.042000 0.031909 0.012088 0.019042 0.013035 0.069337 0.040768 NaN 0.033300 1
3772 0.484470 0.773390 0.000000 0.213940 0.000000 0.000000 0.000000 0.30814 0.120880 NaN ... 0.020615 0.010171 0.024787 0.021214 0.015437 0.021795 0.032656 0.011645 0.012731 1
3773 1.522700 0.000000 0.000000 NaN 0.000000 NaN 0.000000 0.00000 0.000000 NaN ... 0.018538 0.010537 0.027684 0.031280 0.027294 0.013832 NaN 0.033723 0.039819 1
3774 NaN NaN 0.000000 0.000000 NaN NaN 0.000000 0.00000 NaN 0.000000 ... 0.037988 NaN 0.043092 0.031044 0.036253 0.059886 0.047193 0.065673 0.021502 1
3775 NaN NaN 0.000000 0.435930 0.000000 0.000000 0.000000 0.00000 NaN NaN ... 0.032696 0.034843 NaN 0.035841 0.018307 NaN 0.038746 0.038803 NaN 0
3776 0.523340 0.000000 0.000000 NaN 0.000000 0.000000 0.000000 0.17444 0.000000 0.650650 ... 0.042804 0.010049 0.012368 0.032397 0.040944 0.016676 NaN 0.023827 NaN 1
3777 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 0.000000 0.00000 NaN 1.662700 ... 0.015793 0.022092 0.014955 0.018942 0.016944 0.032190 0.028367 0.018088 NaN 1
3778 NaN 0.000000 0.258000 0.507590 0.000000 0.000000 0.009438 0.00000 0.000000 0.143250 ... 0.017947 0.024456 NaN 0.041183 0.047993 0.033153 0.040062 0.023869 NaN 0
3779 0.000000 0.000000 NaN 0.000000 0.167130 0.000000 NaN 0.00000 0.000000 0.000000 ... 0.033826 0.056108 0.062568 NaN 0.028362 0.038791 0.040587 0.035817 0.017098 0
3780 0.000000 0.000000 NaN 0.000000 NaN 1.302700 0.000000 0.00000 NaN 0.000000 ... NaN NaN 0.053071 NaN 0.023068 0.002995 0.007839 NaN 0.014431 1
3781 0.000000 NaN 0.000000 0.283930 1.203500 0.017472 NaN 0.00000 NaN 0.364180 ... 0.018298 0.017468 NaN NaN 0.013216 0.018009 NaN 0.013381 0.009624 1
3782 0.082950 0.153360 0.000000 0.000000 NaN 0.000000 0.973910 NaN 0.043460 1.534700 ... 0.023122 0.011625 NaN 0.008095 0.011113 0.041680 0.019421 0.020782 0.012575 1
3783 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN 0.00000 0.461830 1.176400 ... NaN 0.027828 0.038889 0.021408 0.003900 0.029600 0.029911 0.026131 0.011419 0
3784 0.000000 0.738330 NaN 0.000000 0.000000 NaN NaN 0.00000 0.000000 0.916750 ... 0.037651 0.049141 0.021031 0.028404 0.012977 0.039746 0.040718 0.013986 NaN 1
3785 NaN 0.190430 0.363980 NaN 0.000000 0.000000 NaN 0.38359 0.272360 0.000000 ... 0.036021 0.035155 0.031911 0.057594 0.080866 0.034773 0.047184 0.064976 NaN 1
3786 NaN NaN NaN 0.324010 0.000000 0.000000 0.000000 0.00000 NaN NaN ... NaN 0.005199 NaN 0.006255 0.008951 NaN NaN 0.010050 0.008300 1
3787 0.000000 0.000000 0.501170 0.057088 0.163170 0.646400 0.298280 0.75799 0.087317 0.146870 ... 0.021765 0.016235 0.044767 0.024993 0.010303 0.009179 0.031484 0.070180 0.046984 1
3788 0.000000 0.000000 NaN 0.000000 0.000000 NaN 0.825670 0.00000 NaN NaN ... 0.006477 0.011424 NaN 0.004331 NaN 0.008933 NaN 0.017482 0.031523 0
3789 0.000000 0.000000 0.000000 NaN 0.000000 1.338000 0.000000 0.00000 0.000000 0.113400 ... 0.006814 0.000684 0.001524 NaN 0.001442 0.002318 NaN 0.011766 0.007097 1
3790 0.000000 0.000000 0.000000 0.476500 NaN 0.176770 0.085610 0.75290 0.000000 0.350030 ... NaN NaN 0.021348 0.031300 NaN 0.014002 0.030157 0.037464 0.024539 0
3791 0.000000 0.317140 0.246310 0.000000 0.000000 NaN 0.000000 NaN 0.000000 NaN ... 0.020298 0.006256 0.015469 0.038902 NaN 0.020679 0.027800 0.035840 NaN 1
3792 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 0.75411 0.151220 0.133960 ... NaN NaN 0.056608 0.034379 NaN 0.050165 NaN 0.038335 NaN 1
3793 0.000000 NaN 1.137600 0.000000 0.000000 0.000000 0.084377 0.00000 0.000000 0.947300 ... NaN 0.037247 0.041796 0.026112 0.010530 0.046502 NaN 0.042460 NaN 0
3794 0.000000 0.537230 0.000000 0.000000 0.000000 1.846200 0.000000 0.00000 0.000000 1.628400 ... 0.030879 0.037300 NaN 0.017458 0.017493 0.027442 0.014037 0.011320 0.010543 0
3795 0.000000 0.000000 0.000000 0.262230 0.000000 0.000000 0.000000 NaN 0.000000 0.000000 ... 0.034215 0.043260 NaN 0.042648 0.044968 0.038014 NaN 0.054445 0.032456 0
3796 0.000000 0.944010 0.000000 0.000000 NaN 0.000000 0.000000 0.00000 0.000000 0.000000 ... NaN 0.021328 0.016077 0.019606 NaN 0.005605 0.003127 0.009222 0.019916 1
3797 NaN 0.000000 NaN NaN NaN 0.516080 0.000000 0.13372 0.359210 NaN ... 0.019692 0.013608 0.020126 0.021958 0.035866 0.025194 0.029437 NaN NaN 1
3798 NaN 0.000000 0.146570 0.000000 0.260730 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.022441 0.025916 0.040383 0.045961 0.012540 0.025097 NaN 0.030621 NaN 1
3799 0.000000 NaN 0.293200 0.000000 NaN 0.000000 0.262210 0.00000 0.000000 0.000000 ... 0.012463 0.024990 0.034452 0.014815 0.008251 0.058643 NaN 0.038955 0.010777 1
3800 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 0.490140 NaN 1.968300 0.008118 ... 0.055913 0.009899 NaN 0.017555 0.019566 NaN 0.048685 NaN 0.028767 1

3800 rows × 4609 columns

A couple of imputation methods were tried in the original Notebook:

  1. Imputation using the column (feature) mean
  2. Imputation with K-Nearest Neighbours

The most effective and theoretically advocated method producing the best results was the second method; imputation using the K-Nearest Neighbours. Note: the fancyimpute package may need installing before running. K was set to 3 here, see report for justification.


In [10]:
#imputation via KNN
from fancyimpute import KNN
knn_trial = full_training_data_inc
knn_trial

complete_knn = KNN(k=3).complete(knn_trial)


Imputing row 1/3800 with 0 missing, elapsed time: 718.091
Imputing row 101/3800 with 0 missing, elapsed time: 718.100
Imputing row 201/3800 with 0 missing, elapsed time: 718.105
Imputing row 301/3800 with 0 missing, elapsed time: 718.110
Imputing row 401/3800 with 863 missing, elapsed time: 719.570
Imputing row 501/3800 with 915 missing, elapsed time: 724.257
Imputing row 601/3800 with 880 missing, elapsed time: 727.962
Imputing row 701/3800 with 887 missing, elapsed time: 731.787
Imputing row 801/3800 with 900 missing, elapsed time: 735.372
Imputing row 901/3800 with 920 missing, elapsed time: 738.906
Imputing row 1001/3800 with 948 missing, elapsed time: 742.838
Imputing row 1101/3800 with 940 missing, elapsed time: 747.011
Imputing row 1201/3800 with 942 missing, elapsed time: 750.557
Imputing row 1301/3800 with 894 missing, elapsed time: 754.097
Imputing row 1401/3800 with 911 missing, elapsed time: 757.717
Imputing row 1501/3800 with 980 missing, elapsed time: 761.267
Imputing row 1601/3800 with 958 missing, elapsed time: 764.812
Imputing row 1701/3800 with 971 missing, elapsed time: 768.598
Imputing row 1801/3800 with 909 missing, elapsed time: 772.426
Imputing row 1901/3800 with 899 missing, elapsed time: 776.196
Imputing row 2001/3800 with 902 missing, elapsed time: 779.857
Imputing row 2101/3800 with 911 missing, elapsed time: 783.604
Imputing row 2201/3800 with 922 missing, elapsed time: 788.482
Imputing row 2301/3800 with 908 missing, elapsed time: 792.492
Imputing row 2401/3800 with 903 missing, elapsed time: 796.774
Imputing row 2501/3800 with 868 missing, elapsed time: 800.573
Imputing row 2601/3800 with 939 missing, elapsed time: 804.432
Imputing row 2701/3800 with 937 missing, elapsed time: 807.967
Imputing row 2801/3800 with 942 missing, elapsed time: 811.614
Imputing row 2901/3800 with 934 missing, elapsed time: 815.260
Imputing row 3001/3800 with 915 missing, elapsed time: 819.455
Imputing row 3101/3800 with 918 missing, elapsed time: 823.486
Imputing row 3201/3800 with 915 missing, elapsed time: 827.943
Imputing row 3301/3800 with 946 missing, elapsed time: 832.058
Imputing row 3401/3800 with 914 missing, elapsed time: 835.635
Imputing row 3501/3800 with 926 missing, elapsed time: 839.312
Imputing row 3601/3800 with 903 missing, elapsed time: 842.923
Imputing row 3701/3800 with 958 missing, elapsed time: 846.536

In [11]:
#convert imputed matrix back to dataframe for visualisation and convert 'prediction' dtype to int
complete_knn_df = pd.DataFrame(complete_knn, index=full_training_data_inc.index, columns=full_training_data_inc.columns)
full_training_data = complete_knn_df
full_training_data.prediction = full_training_data.prediction.astype('int')
full_training_data


Out[11]:
CNNs CNNs.1 CNNs.2 CNNs.3 CNNs.4 CNNs.5 CNNs.6 CNNs.7 CNNs.8 CNNs.9 ... GIST.503 GIST.504 GIST.505 GIST.506 GIST.507 GIST.508 GIST.509 GIST.510 GIST.511 prediction
ID
1 0.000000 0.167840 1.477000 0.756510 0.387410 0.000000 0.212950 0.000000 0.000000 0.225750 ... 0.025833 0.021306 0.027640 0.036184 0.047010 0.037981 0.049249 0.059802 0.035669 0
2 0.000000 0.000000 0.000000 0.442600 0.000000 0.000000 0.150240 1.480600 0.635870 0.020341 ... 0.017774 0.020330 0.019916 0.033483 0.015937 0.021656 0.018347 0.017458 0.018744 0
3 0.000000 0.000000 0.000000 0.470420 0.000000 1.277900 0.459540 0.000000 0.000000 0.000000 ... 0.017935 0.005156 0.041298 0.014921 0.015868 0.012122 0.015664 0.011410 0.017450 1
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.030878 0.928510 ... 0.039596 0.007086 0.013696 0.028789 0.022858 0.030883 0.026539 0.021337 0.018109 1
5 0.490990 0.833880 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.188490 0.764420 ... 0.008161 0.036306 0.029198 0.045733 0.008041 0.013111 0.022239 0.058815 0.014322 1
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.025636 0.008809 0.026506 0.018506 0.029058 0.009211 0.013236 0.031606 0.022141 1
7 0.368230 0.000000 0.000000 0.000000 0.000000 0.395810 0.948560 0.000000 0.000000 0.000000 ... 0.031240 0.016638 0.040408 0.028362 0.016704 0.034409 0.025067 0.024614 0.026773 1
8 0.367450 0.000000 0.087409 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.588080 ... 0.052771 0.017765 0.055323 0.067212 0.048452 0.019376 0.056357 0.056325 0.050188 1
9 0.066494 0.000000 0.000000 0.084850 0.608320 0.522730 0.000000 0.378330 0.000000 0.096584 ... 0.020576 0.026661 0.021242 0.019962 0.040603 0.027398 0.019766 0.020432 0.032214 0
10 0.495670 2.536900 0.000000 0.000000 0.000000 1.530300 0.000000 0.000000 0.000000 0.000000 ... 0.011404 0.013138 0.025195 0.017418 0.010645 0.012981 0.039255 0.016495 0.007007 1
11 0.000000 0.000000 0.000000 0.000000 0.623180 0.524910 0.000000 1.349400 0.000000 0.000000 ... 0.024555 0.005029 0.031665 0.040577 0.026261 0.023069 0.043602 0.044524 0.066983 0
12 1.096500 0.720820 0.418210 0.000000 0.312950 0.000000 0.000000 0.000000 0.000000 0.464130 ... 0.013453 0.034157 0.045233 0.044563 0.017900 0.043618 0.076412 0.036831 0.007185 1
13 0.653430 0.000000 0.142020 0.046679 0.000000 0.000000 0.650850 0.000000 0.000000 0.000000 ... 0.047733 0.043096 0.055382 0.050194 0.039210 0.023657 0.021919 0.055182 0.027263 0
14 0.000000 0.000000 0.000000 0.000000 1.299300 0.882630 0.137290 0.000000 0.000000 0.790330 ... 0.002243 0.018213 0.030337 0.011113 0.003206 0.036056 0.024078 0.020279 0.022261 0
15 0.000000 1.103800 0.000000 0.000000 1.145400 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.041530 0.036399 0.031912 0.029357 0.056295 0.012967 0.021085 0.042250 0.037959 0
16 0.658060 1.518300 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.115800 ... 0.039075 0.024941 0.018340 0.025604 0.014863 0.038411 0.051662 0.075225 0.031801 1
17 0.915600 0.000000 0.000000 0.000000 0.000000 0.000000 0.827120 0.000000 0.000000 0.000000 ... 0.030439 0.048584 0.034337 0.049318 0.021250 0.035184 0.032242 0.025961 0.023896 0
18 0.122340 0.484180 0.575630 0.000000 0.056843 0.000000 0.269810 0.803790 0.000000 0.847900 ... 0.043903 0.025202 0.036941 0.101960 0.042061 0.033733 0.056107 0.043020 0.034273 1
19 0.000000 0.000000 0.887790 0.796050 0.949070 0.000000 0.000000 0.000000 0.000000 0.150800 ... 0.062996 0.046529 0.029277 0.048688 0.030056 0.066896 0.064681 0.064771 0.033705 0
20 0.000000 0.911010 0.000000 0.000000 0.336420 0.000000 0.000000 0.180940 0.263610 0.988440 ... 0.008856 0.041469 0.009145 0.009094 0.009796 0.027103 0.042893 0.056196 0.012501 1
21 0.000000 0.000000 0.000000 0.198800 0.000000 0.901150 0.000000 0.000000 0.413540 0.000000 ... 0.020464 0.037698 0.028640 0.023290 0.030678 0.032787 0.049547 0.027087 0.044812 0
22 0.457150 1.565800 0.000000 0.000000 0.000000 0.719880 0.000000 0.000000 0.107970 0.000000 ... 0.036841 0.060575 0.041959 0.047929 0.039467 0.049774 0.082377 0.057957 0.035604 1
23 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.019534 0.006583 0.012513 0.028868 0.039953 0.014977 0.045486 0.021237 0.026507 0
24 1.145900 1.242200 0.000000 0.000000 0.000000 0.956940 0.000000 0.000000 0.000000 0.519750 ... 0.039259 0.018149 0.012153 0.018000 0.037783 0.019433 0.014982 0.025554 0.025585 1
25 0.384660 0.642160 0.000000 0.000000 0.000000 0.000000 0.141730 0.510140 0.000000 0.322540 ... 0.034577 0.031162 0.038961 0.041850 0.026021 0.007156 0.031507 0.048640 0.028067 1
26 0.044793 0.171220 0.150470 0.791640 0.100370 0.000000 0.000000 0.000000 0.000000 0.546100 ... 0.059782 0.027918 0.022951 0.067286 0.072825 0.026592 0.030550 0.047546 0.061572 0
27 0.024028 0.000000 0.000000 1.560600 0.323740 0.573730 0.000000 0.972080 0.402180 0.613800 ... 0.019090 0.011114 0.007980 0.045997 0.045494 0.018651 0.011630 0.011288 0.019492 0
28 0.311980 0.244520 0.212100 0.978550 0.000000 1.319800 0.000000 0.000000 0.000000 0.026332 ... 0.033687 0.066910 0.036916 0.029357 0.017351 0.020543 0.015300 0.016477 0.019715 0
29 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.558100 0.485880 0.815440 ... 0.039693 0.020571 0.044319 0.039485 0.047731 0.043560 0.043651 0.042374 0.048968 0
30 0.269790 0.024128 0.000156 0.000000 0.174980 0.000000 0.337810 0.000000 0.000000 0.000000 ... 0.011549 0.085321 0.016958 0.008131 0.022019 0.031845 0.020188 0.007039 0.012079 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3771 0.000000 0.000000 0.000000 0.000000 0.000000 0.613830 1.456800 1.350900 0.769510 0.000000 ... 0.042000 0.031909 0.012088 0.019042 0.013035 0.069337 0.040768 0.015824 0.033300 1
3772 0.484470 0.773390 0.000000 0.213940 0.000000 0.000000 0.000000 0.308140 0.120880 0.807669 ... 0.020615 0.010171 0.024787 0.021214 0.015437 0.021795 0.032656 0.011645 0.012731 1
3773 1.522700 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.018538 0.010537 0.027684 0.031280 0.027294 0.013832 0.020979 0.033723 0.039819 1
3774 0.945470 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.037988 0.022253 0.043092 0.031044 0.036253 0.059886 0.047193 0.065673 0.021502 1
3775 0.946234 0.067972 0.000000 0.435930 0.000000 0.000000 0.000000 0.000000 0.000000 0.426852 ... 0.032696 0.034843 0.034115 0.035841 0.018307 0.029870 0.038746 0.038803 0.026157 0
3776 0.523340 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.174440 0.000000 0.650650 ... 0.042804 0.010049 0.012368 0.032397 0.040944 0.016676 0.022170 0.023827 0.033898 1
3777 0.000000 0.000000 0.000000 0.000000 0.000000 0.276959 0.000000 0.000000 0.504155 1.662700 ... 0.015793 0.022092 0.014955 0.018942 0.016944 0.032190 0.028367 0.018088 0.015747 1
3778 0.304390 0.000000 0.258000 0.507590 0.000000 0.000000 0.009438 0.000000 0.000000 0.143250 ... 0.017947 0.024456 0.041372 0.041183 0.047993 0.033153 0.040062 0.023869 0.035208 0
3779 0.000000 0.000000 0.000000 0.000000 0.167130 0.000000 0.081230 0.000000 0.000000 0.000000 ... 0.033826 0.056108 0.062568 0.034892 0.028362 0.038791 0.040587 0.035817 0.017098 0
3780 0.000000 0.000000 0.063054 0.000000 0.000000 1.302700 0.000000 0.000000 0.221710 0.000000 ... 0.015461 0.007169 0.053071 0.034744 0.023068 0.002995 0.007839 0.036122 0.014431 1
3781 0.000000 0.072509 0.000000 0.283930 1.203500 0.017472 0.085080 0.000000 0.039984 0.364180 ... 0.018298 0.017468 0.020724 0.016279 0.013216 0.018009 0.011172 0.013381 0.009624 1
3782 0.082950 0.153360 0.000000 0.000000 0.000000 0.000000 0.973910 0.272972 0.043460 1.534700 ... 0.023122 0.011625 0.010129 0.008095 0.011113 0.041680 0.019421 0.020782 0.012575 1
3783 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.461830 1.176400 ... 0.013459 0.027828 0.038889 0.021408 0.003900 0.029600 0.029911 0.026131 0.011419 0
3784 0.000000 0.738330 0.000000 0.000000 0.000000 0.319599 0.000000 0.000000 0.000000 0.916750 ... 0.037651 0.049141 0.021031 0.028404 0.012977 0.039746 0.040718 0.013986 0.018132 1
3785 0.456985 0.190430 0.363980 0.000000 0.000000 0.000000 0.361088 0.383590 0.272360 0.000000 ... 0.036021 0.035155 0.031911 0.057594 0.080866 0.034773 0.047184 0.064976 0.019408 1
3786 0.092517 0.214898 0.000000 0.324010 0.000000 0.000000 0.000000 0.000000 0.191760 0.985159 ... 0.009291 0.005199 0.023028 0.006255 0.008951 0.023196 0.024608 0.010050 0.008300 1
3787 0.000000 0.000000 0.501170 0.057088 0.163170 0.646400 0.298280 0.757990 0.087317 0.146870 ... 0.021765 0.016235 0.044767 0.024993 0.010303 0.009179 0.031484 0.070180 0.046984 1
3788 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.825670 0.000000 0.143230 0.000000 ... 0.006477 0.011424 0.013247 0.004331 0.009335 0.008933 0.015937 0.017482 0.031523 0
3789 0.000000 0.000000 0.000000 0.102527 0.000000 1.338000 0.000000 0.000000 0.000000 0.113400 ... 0.006814 0.000684 0.001524 0.017507 0.001442 0.002318 0.016162 0.011766 0.007097 1
3790 0.000000 0.000000 0.000000 0.476500 0.265184 0.176770 0.085610 0.752900 0.000000 0.350030 ... 0.012905 0.008344 0.021348 0.031300 0.015413 0.014002 0.030157 0.037464 0.024539 0
3791 0.000000 0.317140 0.246310 0.000000 0.000000 0.000000 0.000000 0.457169 0.000000 0.345913 ... 0.020298 0.006256 0.015469 0.038902 0.012438 0.020679 0.027800 0.035840 0.015295 1
3792 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.152793 0.754110 0.151220 0.133960 ... 0.024765 0.026685 0.056608 0.034379 0.030035 0.050165 0.030962 0.038335 0.034809 1
3793 0.000000 0.000000 1.137600 0.000000 0.000000 0.000000 0.084377 0.000000 0.000000 0.947300 ... 0.022490 0.037247 0.041796 0.026112 0.010530 0.046502 0.041396 0.042460 0.033046 0
3794 0.000000 0.537230 0.000000 0.000000 0.000000 1.846200 0.000000 0.000000 0.000000 1.628400 ... 0.030879 0.037300 0.026726 0.017458 0.017493 0.027442 0.014037 0.011320 0.010543 0
3795 0.000000 0.000000 0.000000 0.262230 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.034215 0.043260 0.037229 0.042648 0.044968 0.038014 0.060082 0.054445 0.032456 0
3796 0.000000 0.944010 0.000000 0.000000 0.090126 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.032911 0.021328 0.016077 0.019606 0.035862 0.005605 0.003127 0.009222 0.019916 1
3797 0.000000 0.000000 0.000000 0.620882 0.000000 0.516080 0.000000 0.133720 0.359210 0.769915 ... 0.019692 0.013608 0.020126 0.021958 0.035866 0.025194 0.029437 0.029789 0.018152 1
3798 0.000000 0.000000 0.146570 0.000000 0.260730 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.022441 0.025916 0.040383 0.045961 0.012540 0.025097 0.016432 0.030621 0.016492 1
3799 0.000000 0.041131 0.293200 0.000000 0.024413 0.000000 0.262210 0.000000 0.000000 0.000000 ... 0.012463 0.024990 0.034452 0.014815 0.008251 0.058643 0.050752 0.038955 0.010777 1
3800 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.490140 0.330143 1.968300 0.008118 ... 0.055913 0.009899 0.019455 0.017555 0.019566 0.016966 0.048685 0.035669 0.028767 1

3800 rows × 4609 columns


In [12]:
#quantify class counts for full training data 
full_training_data.prediction.value_counts()


Out[12]:
1    2200
0    1600
Name: prediction, dtype: int64

2.2) Dealing with Confidence Labels

One approach employed to incorporate the confidence labels was to use the confidence label of each instance as the corresponding sample weight for the instance. Theoretically, a confidence label of smaller than 1 would reduce the C parameter, which results in a lower penalty for misclassification of an instance whose label is not known with certainty. However, the implementation of this did not follow the theory; introducing the sample weights reduced the overall accuracy of the model. This matter was complicated more by the fact that samples generated from over-sampling via SMOTE had to be assigned a confidence label, which is difficult to determine objectively. Thus, it was decided that only data instances with a confidence label of 1 should be retained in the training data. This obviously leads to a massive loss of information. However, after removing instances, which do not have a confidence label of 1, 1922 training instances remained, which can be assumed to be a reasonable training data size. After truncating the data set, the procedure described in Section 2.2 was repeated for the truncated training data. In summary, the training data was truncated to only include instances that have a confidence label of 1. The minority class of the training data was over-sampled using SMOTE to balance the class split. Class weights were then applied during the training of the SVM to ensure that the model was more sensitive to correctly classifying the majority class of the test data.

This section will cover how the confidence labels, one of the additional pieces of information provided in the assignment outline, were incorporated into the final training data set.


In [13]:
#Load confidence annotations  
confidence_labels = pd.read_csv("/Users/Max/Desktop/Max's Folder/Uni Work/Data Science MSc/Machine Learning/ML Kaggle Competition /Data Sets/Annotation Confidence .csv", header=0, index_col=0)

In [14]:
#quantify confidence labels (how many are 1, how many are 0.66)
print(confidence_labels.confidence.value_counts())

#observe confidence annotations 
confidence_labels


1.00    1922
0.66    1878
Name: confidence, dtype: int64
Out[14]:
confidence
ID
1 0.66
2 1.00
3 1.00
4 1.00
5 1.00
6 1.00
7 1.00
8 1.00
9 1.00
10 1.00
11 0.66
12 0.66
13 0.66
14 1.00
15 0.66
16 1.00
17 0.66
18 1.00
19 1.00
20 1.00
21 1.00
22 1.00
23 0.66
24 1.00
25 1.00
26 1.00
27 1.00
28 0.66
29 1.00
30 1.00
... ...
3771 0.66
3772 1.00
3773 1.00
3774 0.66
3775 0.66
3776 1.00
3777 0.66
3778 0.66
3779 1.00
3780 0.66
3781 0.66
3782 1.00
3783 1.00
3784 1.00
3785 1.00
3786 0.66
3787 1.00
3788 1.00
3789 1.00
3790 1.00
3791 1.00
3792 0.66
3793 0.66
3794 0.66
3795 0.66
3796 1.00
3797 0.66
3798 0.66
3799 0.66
3800 1.00

3800 rows × 1 columns


In [15]:
#adding confidence of label column to imputed full training data set 
full_train_wcl = pd.merge(full_training_data, confidence_labels, left_index=True, right_index=True)
full_train_wcl


Out[15]:
CNNs CNNs.1 CNNs.2 CNNs.3 CNNs.4 CNNs.5 CNNs.6 CNNs.7 CNNs.8 CNNs.9 ... GIST.504 GIST.505 GIST.506 GIST.507 GIST.508 GIST.509 GIST.510 GIST.511 prediction confidence
ID
1 0.000000 0.167840 1.477000 0.756510 0.387410 0.000000 0.212950 0.000000 0.000000 0.225750 ... 0.021306 0.027640 0.036184 0.047010 0.037981 0.049249 0.059802 0.035669 0 0.66
2 0.000000 0.000000 0.000000 0.442600 0.000000 0.000000 0.150240 1.480600 0.635870 0.020341 ... 0.020330 0.019916 0.033483 0.015937 0.021656 0.018347 0.017458 0.018744 0 1.00
3 0.000000 0.000000 0.000000 0.470420 0.000000 1.277900 0.459540 0.000000 0.000000 0.000000 ... 0.005156 0.041298 0.014921 0.015868 0.012122 0.015664 0.011410 0.017450 1 1.00
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.030878 0.928510 ... 0.007086 0.013696 0.028789 0.022858 0.030883 0.026539 0.021337 0.018109 1 1.00
5 0.490990 0.833880 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.188490 0.764420 ... 0.036306 0.029198 0.045733 0.008041 0.013111 0.022239 0.058815 0.014322 1 1.00
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.008809 0.026506 0.018506 0.029058 0.009211 0.013236 0.031606 0.022141 1 1.00
7 0.368230 0.000000 0.000000 0.000000 0.000000 0.395810 0.948560 0.000000 0.000000 0.000000 ... 0.016638 0.040408 0.028362 0.016704 0.034409 0.025067 0.024614 0.026773 1 1.00
8 0.367450 0.000000 0.087409 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.588080 ... 0.017765 0.055323 0.067212 0.048452 0.019376 0.056357 0.056325 0.050188 1 1.00
9 0.066494 0.000000 0.000000 0.084850 0.608320 0.522730 0.000000 0.378330 0.000000 0.096584 ... 0.026661 0.021242 0.019962 0.040603 0.027398 0.019766 0.020432 0.032214 0 1.00
10 0.495670 2.536900 0.000000 0.000000 0.000000 1.530300 0.000000 0.000000 0.000000 0.000000 ... 0.013138 0.025195 0.017418 0.010645 0.012981 0.039255 0.016495 0.007007 1 1.00
11 0.000000 0.000000 0.000000 0.000000 0.623180 0.524910 0.000000 1.349400 0.000000 0.000000 ... 0.005029 0.031665 0.040577 0.026261 0.023069 0.043602 0.044524 0.066983 0 0.66
12 1.096500 0.720820 0.418210 0.000000 0.312950 0.000000 0.000000 0.000000 0.000000 0.464130 ... 0.034157 0.045233 0.044563 0.017900 0.043618 0.076412 0.036831 0.007185 1 0.66
13 0.653430 0.000000 0.142020 0.046679 0.000000 0.000000 0.650850 0.000000 0.000000 0.000000 ... 0.043096 0.055382 0.050194 0.039210 0.023657 0.021919 0.055182 0.027263 0 0.66
14 0.000000 0.000000 0.000000 0.000000 1.299300 0.882630 0.137290 0.000000 0.000000 0.790330 ... 0.018213 0.030337 0.011113 0.003206 0.036056 0.024078 0.020279 0.022261 0 1.00
15 0.000000 1.103800 0.000000 0.000000 1.145400 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.036399 0.031912 0.029357 0.056295 0.012967 0.021085 0.042250 0.037959 0 0.66
16 0.658060 1.518300 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.115800 ... 0.024941 0.018340 0.025604 0.014863 0.038411 0.051662 0.075225 0.031801 1 1.00
17 0.915600 0.000000 0.000000 0.000000 0.000000 0.000000 0.827120 0.000000 0.000000 0.000000 ... 0.048584 0.034337 0.049318 0.021250 0.035184 0.032242 0.025961 0.023896 0 0.66
18 0.122340 0.484180 0.575630 0.000000 0.056843 0.000000 0.269810 0.803790 0.000000 0.847900 ... 0.025202 0.036941 0.101960 0.042061 0.033733 0.056107 0.043020 0.034273 1 1.00
19 0.000000 0.000000 0.887790 0.796050 0.949070 0.000000 0.000000 0.000000 0.000000 0.150800 ... 0.046529 0.029277 0.048688 0.030056 0.066896 0.064681 0.064771 0.033705 0 1.00
20 0.000000 0.911010 0.000000 0.000000 0.336420 0.000000 0.000000 0.180940 0.263610 0.988440 ... 0.041469 0.009145 0.009094 0.009796 0.027103 0.042893 0.056196 0.012501 1 1.00
21 0.000000 0.000000 0.000000 0.198800 0.000000 0.901150 0.000000 0.000000 0.413540 0.000000 ... 0.037698 0.028640 0.023290 0.030678 0.032787 0.049547 0.027087 0.044812 0 1.00
22 0.457150 1.565800 0.000000 0.000000 0.000000 0.719880 0.000000 0.000000 0.107970 0.000000 ... 0.060575 0.041959 0.047929 0.039467 0.049774 0.082377 0.057957 0.035604 1 1.00
23 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.006583 0.012513 0.028868 0.039953 0.014977 0.045486 0.021237 0.026507 0 0.66
24 1.145900 1.242200 0.000000 0.000000 0.000000 0.956940 0.000000 0.000000 0.000000 0.519750 ... 0.018149 0.012153 0.018000 0.037783 0.019433 0.014982 0.025554 0.025585 1 1.00
25 0.384660 0.642160 0.000000 0.000000 0.000000 0.000000 0.141730 0.510140 0.000000 0.322540 ... 0.031162 0.038961 0.041850 0.026021 0.007156 0.031507 0.048640 0.028067 1 1.00
26 0.044793 0.171220 0.150470 0.791640 0.100370 0.000000 0.000000 0.000000 0.000000 0.546100 ... 0.027918 0.022951 0.067286 0.072825 0.026592 0.030550 0.047546 0.061572 0 1.00
27 0.024028 0.000000 0.000000 1.560600 0.323740 0.573730 0.000000 0.972080 0.402180 0.613800 ... 0.011114 0.007980 0.045997 0.045494 0.018651 0.011630 0.011288 0.019492 0 1.00
28 0.311980 0.244520 0.212100 0.978550 0.000000 1.319800 0.000000 0.000000 0.000000 0.026332 ... 0.066910 0.036916 0.029357 0.017351 0.020543 0.015300 0.016477 0.019715 0 0.66
29 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.558100 0.485880 0.815440 ... 0.020571 0.044319 0.039485 0.047731 0.043560 0.043651 0.042374 0.048968 0 1.00
30 0.269790 0.024128 0.000156 0.000000 0.174980 0.000000 0.337810 0.000000 0.000000 0.000000 ... 0.085321 0.016958 0.008131 0.022019 0.031845 0.020188 0.007039 0.012079 1 1.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3771 0.000000 0.000000 0.000000 0.000000 0.000000 0.613830 1.456800 1.350900 0.769510 0.000000 ... 0.031909 0.012088 0.019042 0.013035 0.069337 0.040768 0.015824 0.033300 1 0.66
3772 0.484470 0.773390 0.000000 0.213940 0.000000 0.000000 0.000000 0.308140 0.120880 0.807669 ... 0.010171 0.024787 0.021214 0.015437 0.021795 0.032656 0.011645 0.012731 1 1.00
3773 1.522700 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.010537 0.027684 0.031280 0.027294 0.013832 0.020979 0.033723 0.039819 1 1.00
3774 0.945470 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.022253 0.043092 0.031044 0.036253 0.059886 0.047193 0.065673 0.021502 1 0.66
3775 0.946234 0.067972 0.000000 0.435930 0.000000 0.000000 0.000000 0.000000 0.000000 0.426852 ... 0.034843 0.034115 0.035841 0.018307 0.029870 0.038746 0.038803 0.026157 0 0.66
3776 0.523340 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.174440 0.000000 0.650650 ... 0.010049 0.012368 0.032397 0.040944 0.016676 0.022170 0.023827 0.033898 1 1.00
3777 0.000000 0.000000 0.000000 0.000000 0.000000 0.276959 0.000000 0.000000 0.504155 1.662700 ... 0.022092 0.014955 0.018942 0.016944 0.032190 0.028367 0.018088 0.015747 1 0.66
3778 0.304390 0.000000 0.258000 0.507590 0.000000 0.000000 0.009438 0.000000 0.000000 0.143250 ... 0.024456 0.041372 0.041183 0.047993 0.033153 0.040062 0.023869 0.035208 0 0.66
3779 0.000000 0.000000 0.000000 0.000000 0.167130 0.000000 0.081230 0.000000 0.000000 0.000000 ... 0.056108 0.062568 0.034892 0.028362 0.038791 0.040587 0.035817 0.017098 0 1.00
3780 0.000000 0.000000 0.063054 0.000000 0.000000 1.302700 0.000000 0.000000 0.221710 0.000000 ... 0.007169 0.053071 0.034744 0.023068 0.002995 0.007839 0.036122 0.014431 1 0.66
3781 0.000000 0.072509 0.000000 0.283930 1.203500 0.017472 0.085080 0.000000 0.039984 0.364180 ... 0.017468 0.020724 0.016279 0.013216 0.018009 0.011172 0.013381 0.009624 1 0.66
3782 0.082950 0.153360 0.000000 0.000000 0.000000 0.000000 0.973910 0.272972 0.043460 1.534700 ... 0.011625 0.010129 0.008095 0.011113 0.041680 0.019421 0.020782 0.012575 1 1.00
3783 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.461830 1.176400 ... 0.027828 0.038889 0.021408 0.003900 0.029600 0.029911 0.026131 0.011419 0 1.00
3784 0.000000 0.738330 0.000000 0.000000 0.000000 0.319599 0.000000 0.000000 0.000000 0.916750 ... 0.049141 0.021031 0.028404 0.012977 0.039746 0.040718 0.013986 0.018132 1 1.00
3785 0.456985 0.190430 0.363980 0.000000 0.000000 0.000000 0.361088 0.383590 0.272360 0.000000 ... 0.035155 0.031911 0.057594 0.080866 0.034773 0.047184 0.064976 0.019408 1 1.00
3786 0.092517 0.214898 0.000000 0.324010 0.000000 0.000000 0.000000 0.000000 0.191760 0.985159 ... 0.005199 0.023028 0.006255 0.008951 0.023196 0.024608 0.010050 0.008300 1 0.66
3787 0.000000 0.000000 0.501170 0.057088 0.163170 0.646400 0.298280 0.757990 0.087317 0.146870 ... 0.016235 0.044767 0.024993 0.010303 0.009179 0.031484 0.070180 0.046984 1 1.00
3788 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.825670 0.000000 0.143230 0.000000 ... 0.011424 0.013247 0.004331 0.009335 0.008933 0.015937 0.017482 0.031523 0 1.00
3789 0.000000 0.000000 0.000000 0.102527 0.000000 1.338000 0.000000 0.000000 0.000000 0.113400 ... 0.000684 0.001524 0.017507 0.001442 0.002318 0.016162 0.011766 0.007097 1 1.00
3790 0.000000 0.000000 0.000000 0.476500 0.265184 0.176770 0.085610 0.752900 0.000000 0.350030 ... 0.008344 0.021348 0.031300 0.015413 0.014002 0.030157 0.037464 0.024539 0 1.00
3791 0.000000 0.317140 0.246310 0.000000 0.000000 0.000000 0.000000 0.457169 0.000000 0.345913 ... 0.006256 0.015469 0.038902 0.012438 0.020679 0.027800 0.035840 0.015295 1 1.00
3792 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.152793 0.754110 0.151220 0.133960 ... 0.026685 0.056608 0.034379 0.030035 0.050165 0.030962 0.038335 0.034809 1 0.66
3793 0.000000 0.000000 1.137600 0.000000 0.000000 0.000000 0.084377 0.000000 0.000000 0.947300 ... 0.037247 0.041796 0.026112 0.010530 0.046502 0.041396 0.042460 0.033046 0 0.66
3794 0.000000 0.537230 0.000000 0.000000 0.000000 1.846200 0.000000 0.000000 0.000000 1.628400 ... 0.037300 0.026726 0.017458 0.017493 0.027442 0.014037 0.011320 0.010543 0 0.66
3795 0.000000 0.000000 0.000000 0.262230 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.043260 0.037229 0.042648 0.044968 0.038014 0.060082 0.054445 0.032456 0 0.66
3796 0.000000 0.944010 0.000000 0.000000 0.090126 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.021328 0.016077 0.019606 0.035862 0.005605 0.003127 0.009222 0.019916 1 1.00
3797 0.000000 0.000000 0.000000 0.620882 0.000000 0.516080 0.000000 0.133720 0.359210 0.769915 ... 0.013608 0.020126 0.021958 0.035866 0.025194 0.029437 0.029789 0.018152 1 0.66
3798 0.000000 0.000000 0.146570 0.000000 0.260730 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.025916 0.040383 0.045961 0.012540 0.025097 0.016432 0.030621 0.016492 1 0.66
3799 0.000000 0.041131 0.293200 0.000000 0.024413 0.000000 0.262210 0.000000 0.000000 0.000000 ... 0.024990 0.034452 0.014815 0.008251 0.058643 0.050752 0.038955 0.010777 1 0.66
3800 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.490140 0.330143 1.968300 0.008118 ... 0.009899 0.019455 0.017555 0.019566 0.016966 0.048685 0.035669 0.028767 1 1.00

3800 rows × 4610 columns

The original Notebook tried a couple of methods of incorporating the confidence labels into the model:

  1. Use all data samples, irrespective of confidence labels. However, the confidence label of each instance was set to be sample weight of each instance in the training phase.
  2. Only use instances that have a confidence label of 1.

The best model was based on Method 2. Thus, only method 2 will be shown for this section here.


In [16]:
#only keep data instance with confidence label = 1
conf_full_train = full_train_wcl.loc[full_train_wcl['confidence'] == 1]
conf_full_train


Out[16]:
CNNs CNNs.1 CNNs.2 CNNs.3 CNNs.4 CNNs.5 CNNs.6 CNNs.7 CNNs.8 CNNs.9 ... GIST.504 GIST.505 GIST.506 GIST.507 GIST.508 GIST.509 GIST.510 GIST.511 prediction confidence
ID
2 0.000000 0.000000 0.000000 0.442600 0.000000 0.000000 0.150240 1.480600 0.635870 0.020341 ... 0.020330 0.019916 0.033483 0.015937 0.021656 0.018347 0.017458 0.018744 0 1.0
3 0.000000 0.000000 0.000000 0.470420 0.000000 1.277900 0.459540 0.000000 0.000000 0.000000 ... 0.005156 0.041298 0.014921 0.015868 0.012122 0.015664 0.011410 0.017450 1 1.0
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.030878 0.928510 ... 0.007086 0.013696 0.028789 0.022858 0.030883 0.026539 0.021337 0.018109 1 1.0
5 0.490990 0.833880 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.188490 0.764420 ... 0.036306 0.029198 0.045733 0.008041 0.013111 0.022239 0.058815 0.014322 1 1.0
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.008809 0.026506 0.018506 0.029058 0.009211 0.013236 0.031606 0.022141 1 1.0
7 0.368230 0.000000 0.000000 0.000000 0.000000 0.395810 0.948560 0.000000 0.000000 0.000000 ... 0.016638 0.040408 0.028362 0.016704 0.034409 0.025067 0.024614 0.026773 1 1.0
8 0.367450 0.000000 0.087409 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.588080 ... 0.017765 0.055323 0.067212 0.048452 0.019376 0.056357 0.056325 0.050188 1 1.0
9 0.066494 0.000000 0.000000 0.084850 0.608320 0.522730 0.000000 0.378330 0.000000 0.096584 ... 0.026661 0.021242 0.019962 0.040603 0.027398 0.019766 0.020432 0.032214 0 1.0
10 0.495670 2.536900 0.000000 0.000000 0.000000 1.530300 0.000000 0.000000 0.000000 0.000000 ... 0.013138 0.025195 0.017418 0.010645 0.012981 0.039255 0.016495 0.007007 1 1.0
14 0.000000 0.000000 0.000000 0.000000 1.299300 0.882630 0.137290 0.000000 0.000000 0.790330 ... 0.018213 0.030337 0.011113 0.003206 0.036056 0.024078 0.020279 0.022261 0 1.0
16 0.658060 1.518300 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.115800 ... 0.024941 0.018340 0.025604 0.014863 0.038411 0.051662 0.075225 0.031801 1 1.0
18 0.122340 0.484180 0.575630 0.000000 0.056843 0.000000 0.269810 0.803790 0.000000 0.847900 ... 0.025202 0.036941 0.101960 0.042061 0.033733 0.056107 0.043020 0.034273 1 1.0
19 0.000000 0.000000 0.887790 0.796050 0.949070 0.000000 0.000000 0.000000 0.000000 0.150800 ... 0.046529 0.029277 0.048688 0.030056 0.066896 0.064681 0.064771 0.033705 0 1.0
20 0.000000 0.911010 0.000000 0.000000 0.336420 0.000000 0.000000 0.180940 0.263610 0.988440 ... 0.041469 0.009145 0.009094 0.009796 0.027103 0.042893 0.056196 0.012501 1 1.0
21 0.000000 0.000000 0.000000 0.198800 0.000000 0.901150 0.000000 0.000000 0.413540 0.000000 ... 0.037698 0.028640 0.023290 0.030678 0.032787 0.049547 0.027087 0.044812 0 1.0
22 0.457150 1.565800 0.000000 0.000000 0.000000 0.719880 0.000000 0.000000 0.107970 0.000000 ... 0.060575 0.041959 0.047929 0.039467 0.049774 0.082377 0.057957 0.035604 1 1.0
24 1.145900 1.242200 0.000000 0.000000 0.000000 0.956940 0.000000 0.000000 0.000000 0.519750 ... 0.018149 0.012153 0.018000 0.037783 0.019433 0.014982 0.025554 0.025585 1 1.0
25 0.384660 0.642160 0.000000 0.000000 0.000000 0.000000 0.141730 0.510140 0.000000 0.322540 ... 0.031162 0.038961 0.041850 0.026021 0.007156 0.031507 0.048640 0.028067 1 1.0
26 0.044793 0.171220 0.150470 0.791640 0.100370 0.000000 0.000000 0.000000 0.000000 0.546100 ... 0.027918 0.022951 0.067286 0.072825 0.026592 0.030550 0.047546 0.061572 0 1.0
27 0.024028 0.000000 0.000000 1.560600 0.323740 0.573730 0.000000 0.972080 0.402180 0.613800 ... 0.011114 0.007980 0.045997 0.045494 0.018651 0.011630 0.011288 0.019492 0 1.0
29 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.558100 0.485880 0.815440 ... 0.020571 0.044319 0.039485 0.047731 0.043560 0.043651 0.042374 0.048968 0 1.0
30 0.269790 0.024128 0.000156 0.000000 0.174980 0.000000 0.337810 0.000000 0.000000 0.000000 ... 0.085321 0.016958 0.008131 0.022019 0.031845 0.020188 0.007039 0.012079 1 1.0
31 0.000000 0.588330 0.000000 0.438650 0.000000 0.000000 0.000000 0.000000 1.365300 0.000000 ... 0.001618 0.011399 0.013321 0.011482 0.004453 0.015329 0.011126 0.013029 0 1.0
32 0.000000 0.000000 0.000000 0.802240 0.000000 0.808990 0.000000 0.000000 0.146920 0.000000 ... 0.005147 0.020457 0.016770 0.026065 0.008703 0.009715 0.017511 0.020147 0 1.0
34 0.000000 0.144530 0.000000 1.216500 0.000000 1.906900 0.000000 0.770710 0.000000 0.807150 ... 0.014471 0.018758 0.017670 0.024831 0.006508 0.015084 0.019683 0.018637 0 1.0
36 0.530580 0.180050 0.004283 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.411300 ... 0.021727 0.015733 0.024622 0.023215 0.021216 0.036337 0.043055 0.023884 1 1.0
41 0.000000 1.095000 0.000000 0.000000 0.855240 0.000000 0.270590 1.435700 0.000000 0.000000 ... 0.013380 0.047201 0.036005 0.007252 0.025466 0.082856 0.034223 0.023905 0 1.0
42 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.445730 0.000000 0.987190 ... 0.004484 0.005163 0.047327 0.022377 0.013530 0.019314 0.020233 0.016985 1 1.0
44 0.563240 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.157900 ... 0.023095 0.027842 0.048980 0.006251 0.034309 0.035927 0.089608 0.013321 1 1.0
46 0.000000 0.000000 0.000000 0.000000 0.000000 0.791610 0.038102 0.000000 0.345330 0.000000 ... 0.003776 0.016488 0.038568 0.027254 0.030923 0.038226 0.043074 0.038702 1 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3736 0.000000 0.000000 0.379880 0.000000 0.430587 0.000000 1.310400 0.718670 0.000000 1.206700 ... 0.022346 0.034921 0.021108 0.028067 0.028694 0.035580 0.017478 0.012353 0 1.0
3737 0.000000 0.000000 0.000000 0.201489 0.000000 0.000000 0.142460 1.482600 0.000000 0.000000 ... 0.005164 0.016672 0.019354 0.020468 0.026234 0.026968 0.021722 0.031073 0 1.0
3738 0.000000 0.000000 0.000000 0.000000 0.000000 0.560290 0.000000 0.226157 0.000000 1.286300 ... 0.035721 0.033835 0.034028 0.018540 0.024019 0.023726 0.019669 0.018807 1 1.0
3744 0.000000 0.000000 0.000000 1.864400 0.000000 0.000000 0.000000 0.558016 0.177810 0.686460 ... 0.022472 0.106690 0.088522 0.046389 0.020237 0.031843 0.037608 0.043574 0 1.0
3746 2.080200 4.446900 0.000000 0.000000 0.186504 0.000000 0.000000 0.000000 0.000000 1.808500 ... 0.062080 0.045719 0.071660 0.046617 0.060897 0.059359 0.034978 0.064414 1 1.0
3749 0.000000 0.123121 0.000000 0.000000 0.000000 2.269100 1.344900 0.642410 0.000000 0.000000 ... 0.018565 0.042016 0.038489 0.022180 0.032714 0.031088 0.020690 0.026536 0 1.0
3750 0.000000 0.326500 0.000000 0.000000 0.000000 0.000000 0.168060 0.627750 1.645500 0.812560 ... 0.040700 0.042014 0.046628 0.033452 0.029408 0.027535 0.026706 0.027509 1 1.0
3752 0.000000 0.000000 0.004252 0.316155 0.032557 0.519740 0.470590 0.677960 0.000000 0.226440 ... 0.019529 0.021082 0.028168 0.031610 0.033692 0.048022 0.052835 0.025539 0 1.0
3753 0.000000 0.000000 0.201820 0.620210 0.200890 0.603980 0.000000 0.800740 0.792020 0.023076 ... 0.001814 0.012130 0.025099 0.006197 0.004581 0.013265 0.016408 0.019732 0 1.0
3754 0.000000 0.442410 0.019417 0.000000 0.000000 0.121730 0.508880 0.270060 0.172236 0.000000 ... 0.018659 0.035507 0.049965 0.024217 0.018574 0.018603 0.046691 0.030947 1 1.0
3756 0.706050 0.000000 0.000000 0.000000 0.000000 0.000000 0.101560 0.925570 0.008006 0.000000 ... 0.030222 0.018685 0.030030 0.015352 0.028444 0.045679 0.059895 0.047354 1 1.0
3760 0.000000 0.000000 0.000000 0.764230 0.000000 0.279088 0.274390 0.016104 0.347830 0.958480 ... 0.010139 0.021263 0.022355 0.029823 0.009918 0.037257 0.036307 0.021285 1 1.0
3762 0.267280 0.000000 0.511380 0.000000 0.002737 0.706080 0.033460 0.000000 0.121620 0.106471 ... 0.041355 0.042160 0.045000 0.019701 0.036989 0.037595 0.043883 0.036089 0 1.0
3763 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.106660 0.644870 1.045280 ... 0.026702 0.025092 0.003558 0.014179 0.042395 0.021666 0.006262 0.039873 1 1.0
3765 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.191198 0.000000 0.000000 0.184490 ... 0.022409 0.042996 0.012631 0.009260 0.027802 0.033995 0.013917 0.015682 0 1.0
3772 0.484470 0.773390 0.000000 0.213940 0.000000 0.000000 0.000000 0.308140 0.120880 0.807669 ... 0.010171 0.024787 0.021214 0.015437 0.021795 0.032656 0.011645 0.012731 1 1.0
3773 1.522700 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.010537 0.027684 0.031280 0.027294 0.013832 0.020979 0.033723 0.039819 1 1.0
3776 0.523340 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.174440 0.000000 0.650650 ... 0.010049 0.012368 0.032397 0.040944 0.016676 0.022170 0.023827 0.033898 1 1.0
3779 0.000000 0.000000 0.000000 0.000000 0.167130 0.000000 0.081230 0.000000 0.000000 0.000000 ... 0.056108 0.062568 0.034892 0.028362 0.038791 0.040587 0.035817 0.017098 0 1.0
3782 0.082950 0.153360 0.000000 0.000000 0.000000 0.000000 0.973910 0.272972 0.043460 1.534700 ... 0.011625 0.010129 0.008095 0.011113 0.041680 0.019421 0.020782 0.012575 1 1.0
3783 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.461830 1.176400 ... 0.027828 0.038889 0.021408 0.003900 0.029600 0.029911 0.026131 0.011419 0 1.0
3784 0.000000 0.738330 0.000000 0.000000 0.000000 0.319599 0.000000 0.000000 0.000000 0.916750 ... 0.049141 0.021031 0.028404 0.012977 0.039746 0.040718 0.013986 0.018132 1 1.0
3785 0.456985 0.190430 0.363980 0.000000 0.000000 0.000000 0.361088 0.383590 0.272360 0.000000 ... 0.035155 0.031911 0.057594 0.080866 0.034773 0.047184 0.064976 0.019408 1 1.0
3787 0.000000 0.000000 0.501170 0.057088 0.163170 0.646400 0.298280 0.757990 0.087317 0.146870 ... 0.016235 0.044767 0.024993 0.010303 0.009179 0.031484 0.070180 0.046984 1 1.0
3788 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.825670 0.000000 0.143230 0.000000 ... 0.011424 0.013247 0.004331 0.009335 0.008933 0.015937 0.017482 0.031523 0 1.0
3789 0.000000 0.000000 0.000000 0.102527 0.000000 1.338000 0.000000 0.000000 0.000000 0.113400 ... 0.000684 0.001524 0.017507 0.001442 0.002318 0.016162 0.011766 0.007097 1 1.0
3790 0.000000 0.000000 0.000000 0.476500 0.265184 0.176770 0.085610 0.752900 0.000000 0.350030 ... 0.008344 0.021348 0.031300 0.015413 0.014002 0.030157 0.037464 0.024539 0 1.0
3791 0.000000 0.317140 0.246310 0.000000 0.000000 0.000000 0.000000 0.457169 0.000000 0.345913 ... 0.006256 0.015469 0.038902 0.012438 0.020679 0.027800 0.035840 0.015295 1 1.0
3796 0.000000 0.944010 0.000000 0.000000 0.090126 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.021328 0.016077 0.019606 0.035862 0.005605 0.003127 0.009222 0.019916 1 1.0
3800 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.490140 0.330143 1.968300 0.008118 ... 0.009899 0.019455 0.017555 0.019566 0.016966 0.048685 0.035669 0.028767 1 1.0

1922 rows × 4610 columns


In [17]:
#quantify class counts 
conf_full_train.prediction.value_counts()


Out[17]:
1    1106
0     816
Name: prediction, dtype: int64

In [18]:
#convert full training data dataframe with confidence instances only to matrix
conf_ft_matrix = conf_full_train.as_matrix(columns=None)
conf_ft_matrix
conf_ft_matrix.shape


Out[18]:
(1922, 4610)

In [19]:
#splitting full training data with confidence into inputs and outputs 
conf_ft_inputs = conf_ft_matrix[:,0:4608]
print(conf_ft_inputs.shape)
conf_ft_outputs = conf_ft_matrix[:,4608]
print(conf_ft_outputs.shape)


(1922, 4608)
(1922,)

2.3) Dealing with Class Imbalance

Binary classification tasks suffer from imbalanced class splits. Training a model on a data set with more instances for one class than the other class can result in biases towards the majority class, as sensitivity will be lost in detecting the minority class [17]. This is pertinent because the training data (additional and original data included) has an unbalanced class split, with more instances of Class 1 than Class 0. Thus, training the model on this data would result in a model that is biased towards Class 1 detections. To exacerbate this issue, the test data is also unbalanced, but the majority class for the test data is Class 0. Some researchers have already attempted to tackle this problem. There are two primary methods of dealing with class imbalances: balancing or further unbalancing the data set as needs fit, or introducing class weights, where the underlying algorithm applies disparate misclassification penalties to different classes [15]. Both approaches will be combined here, to first balance the data set, and then train the model to be bias towards Class 0 instances, as Class 0 is the majority class in the test data. The ‘imbalanced-learn’ API [16] has implementations of class balancing strategies found in the literature, such as SMOTE [17]. The premise of SMOTE is over-sampling of the minority class until a balance in the data set is reached. The sampling method is based on sampling via kNN. Unlike kNN for imputation, the best k was suggested to be 5 here. Once the data set was balanced through SMOTE, class weights were introduced. Considering the test data has more instances belonging to Class 0, the class weights were adjusted so that misclassification of Class 0 is penalized more heavily than misclassification of Class 1. The ratio of class weights for training were adjusted to match the class proportions of the test data, i.e. Class 0 weight = 1.33 and Class 1 weight = 1. The reason over-sampling of the minority class was preferred over under-sampling of the majority class is because the data quantity was already scarce (evident from Section 2.3). Furthermore, over-sampling to reach a class balance permits the use of ‘accuracy’ as the accuracy metric, as opposed to using AUC, which is more complex. In summary, as well as balancing the train data class-split, the model itself was adjusted to place more emphasis on correct Class 0 classifications.

This section will cover how the class imbalance of the training data was addressed. The best approach for this was Over-Sampling using SMOTE. This technique over-samples the minority class until the data set is completely balanced. Note: may need to install imblearn package first.


In [20]:
from imblearn.over_sampling import SMOTE 
from collections import Counter

In [21]:
#fit over-sampling to training data inputs and putputs
over_sampler = SMOTE(ratio='auto', k_neighbors=5, kind='regular', random_state=0)
over_sampler.fit(conf_ft_inputs, conf_ft_outputs)


Out[21]:
SMOTE(k=None, k_neighbors=5, kind='regular', m=None, m_neighbors=10, n_jobs=1,
   out_step=0.5, random_state=0, ratio='auto', svm_estimator=None)

In [22]:
#create new inputs and outputs with correct class proportions 
resampled_x, resampled_y = over_sampler.fit_sample(conf_ft_inputs, conf_ft_outputs)

In [23]:
#quantify original class proportions prior to over-sampling
Counter(conf_ft_outputs)


Out[23]:
Counter({0.0: 816, 1.0: 1106})

In [24]:
#quantify class proportions after over-sampling
Counter(resampled_y)


Out[24]:
Counter({0.0: 1106, 1.0: 1106})

In [25]:
#assign newly sampled input and outputs to old variable name used for inputs and outputs before
#over-sampling 
conf_ft_inputs = resampled_x
conf_ft_outputs = resampled_y
print(Counter(conf_ft_outputs))


Counter({0.0: 1106, 1.0: 1106})

3. Pre-Processing

The pre-processing of the data consisted of several steps. First, the features were rescaled appropriately. Secondly, Feature Extraction was performed to reduce the unwieldy dimensionality of the training data set, concomitantly increasing the signal-to-noise ratio and decreasing time complexity.

This section will cover the Pre-Processing conducted that produced the model capable of producing the best predictions. Feature Scaling was achieved via several methods. The best method was standardisation. Feature Extraction was achieved via PCA.

3.1) Feature Scaling Feature scaling is important because it ensures that features have values plotted on the same scale, irrespective of the original units used to describe the original features. Feature scaling can be in the form of standardization, normalization or rescaling. The correct choice of feature scaling method is arbitrary and highly dependent on context. Thus, all three approaches were tried. The optimal results were obtained for standardization.


In [26]:
#standardise the full training data with confidence labels 1 only
scaler_2 = preprocessing.StandardScaler().fit(conf_ft_inputs)
std_conf_ft_in = scaler_2.transform(conf_ft_inputs)
std_conf_ft_in


Out[26]:
array([[-0.45543783, -0.4535694 , -0.30705458, ..., -0.70537079,
        -0.89527325, -0.42860037],
       [-0.45543783, -0.4535694 , -0.30705458, ..., -0.88589433,
        -1.30662422, -0.52141952],
       [-0.45543783, -0.4535694 , -0.30705458, ..., -0.15417854,
        -0.63144547, -0.47414918],
       ..., 
       [-0.45543783, -0.4535694 , -0.30705458, ..., -1.34145834,
        -0.93541528, -0.13391536],
       [ 0.5815841 , -0.4535694 ,  0.80741398, ...,  0.92682162,
         2.74538189,  1.97810955],
       [-0.45543783, -0.37984075, -0.30705458, ..., -0.22601672,
        -0.67602594,  0.20590276]])

3.2) Principal Component Analysis (PCA)

High-dimensionality should be reduced because it is likely to contain noisy features and because high-dimensionality increases computational time complexity [18]. Dimensionality reduction can be achieved via feature selection methods, such as filters and wrappers [19], or via feature extraction methods, such as PCA [20]. Here, the dimensionality reduction was conducted via feature extraction, vicariously through PCA. The rationale behind this is that the relative importance of GIST and CNN features is undetermined. Furthermore, feature selection methods may require some domain expertise to be effective. PCA uses the covariance matrix, its eigenvectors and eigenvalues to engineer principal components, which are uncorrelated eigenvectors that explain some proportion of the variance found in the dataset. The optimal number of principal components to engineer is arbitrary. Thus, the optimal number of principal components can be configured experimentally. This can be achieved by plotting the change in variance explained as a function of the number of principal components included, and by calculating the test score during cross validation for data transformed using different numbers of principal components.


In [27]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#preprocessing: PCA (feature construction). High number of pcs chosen to plot a graph
#showing how much more variance is explained as pc number increases 
pca_2 = PCA(n_components=700, random_state=0)
std_conf_ft_in_pca = pca_2.fit_transform(std_conf_ft_in)
#quantify amount of variance explained by principal components
print("Total Variance Explained by PCs (%): ", np.sum(pca_2.explained_variance_ratio_))


Total Variance Explained by PCs (%):  0.917280339666

The cell below will plot how much more of the variance in the data set is explained as the number of principal components included is increased.


In [28]:
#calculate a list of cumulative sums for amount of variance explained
cumulative_variance = np.cumsum(pca_2.explained_variance_ratio_)
len(cumulative_variance)
#add 0 to the beginning of the list, otherwise list starts with variance explained by 1 pc
cumulative_variance = np.insert(cumulative_variance, 0, 0) 

#define range of pcs
pcs_4_var_exp = np.arange(0,701,1)
len(pcs_4_var_exp)

fig_1 = plt.figure(figsize=(7,4))
plt.title('Number of PCs and Change In Variance Explained')
plt.xlabel('Number of PCs')
plt.ylabel('Variance Explained (%)')
plt.plot(pcs_4_var_exp, cumulative_variance, 'x-', color="r")
plt.show()


The graph above suggests that the maximum number of principal components should not exceed 300, as less and less variance is explained as the number of principal components included increases beyond 300. For the optimisation, the optimal number of principal components was initially assumed to be 230.


In [29]:
#preprocessing: PCA (feature construction)
pca_2 = PCA(n_components=230, random_state=0)
std_conf_ft_in_pca = pca_2.fit_transform(std_conf_ft_in)
#quantify ratio of variance explain by principal components
print("Total Variance Explained by PCs (%): ", np.sum(pca_2.explained_variance_ratio_))


Total Variance Explained by PCs (%):  0.78651096325

4. Model Selection

The optimization was conducted through the use of a Grid search. In addition, the optimization was conducted for two kernels: the polynomial kernel and the RBF kernel. The initial search for optimal parameters was conducted on a logarithmic scale to explore as much of the parameter space as possible. From the results, the parameter ranges were refined and pruned to only include the potential best candidates. The choice of parameters was purely based on accuracy metrics, not on any other practical factors such as memory consumption or time complexity of predictions. The best model was determined on the following merits:

  1. Good generalisation - achieving a high testing 356 score during cross-validation.
  2. Avoidance of over-fitting - restriction on the magnitude of training scores during cross-validation. In particular, a training score beyond 360 an arbitrary limit is indicative of over-fitting. 361 Thus, a balance had to be struck to ensure that good 362 generalisation can be assumed. This section covers how the best model was selected. Two kernels were tried and tested: RBF and polynomial. RBF outperformed polynomial, therefore only the optimisation results of RBF will be presented here. Furthermore, the parameter ranges to try have already been pruned at this point, so only the final ranges will be used to perform a Grid Search.

4.1) Parameter Optimisation


In [30]:
#this cell takes around 7 minutes to run
#parameter optimisation with Exhaustive Grid Search, with class weight 
original_c_range = np.arange(0.85, 1.01, 0.01)
gamma_range = np.arange(0.00001, 0.00023, 0.00002)

#define parameter ranges to test
param_grid = [{'C': original_c_range, 'gamma': gamma_range, 'kernel': ['rbf'],
             'class_weight':[{0:1.33, 1:1}]}]

#define model to do parameter search on
svr = SVC()
clf = GridSearchCV(svr, param_grid, scoring='accuracy', cv=5,)
clf.fit(std_conf_ft_in_pca, conf_ft_outputs)

#create dictionary of results
results_dict = clf.cv_results_

#convert the results into a dataframe
df_results = pd.DataFrame.from_dict(results_dict)
df_results


Out[30]:
mean_fit_time mean_score_time mean_test_score mean_train_score param_C param_class_weight param_gamma param_kernel params rank_test_score ... split2_test_score split2_train_score split3_test_score split3_train_score split4_test_score split4_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.603728 0.144425 0.876582 0.886303 0.85 {0: 1.33, 1: 1} 1e-05 rbf {'kernel': 'rbf', 'gamma': 1e-05, 'class_weigh... 183 ... 0.880090 0.881356 0.873303 0.885311 0.857466 0.890960 0.050004 0.038218 0.011687 0.003265
1 0.404284 0.083440 0.884268 0.900656 0.85 {0: 1.33, 1: 1} 3e-05 rbf {'kernel': 'rbf', 'gamma': 3e-05, 'class_weigh... 163 ... 0.904977 0.897740 0.873303 0.900565 0.871041 0.904520 0.020091 0.000607 0.013607 0.003712
2 0.362656 0.078480 0.886528 0.907550 0.85 {0: 1.33, 1: 1} 5e-05 rbf {'kernel': 'rbf', 'gamma': 5e-05, 'class_weigh... 151 ... 0.907240 0.907345 0.882353 0.909040 0.871041 0.908475 0.004201 0.001375 0.012674 0.001895
3 0.361930 0.076454 0.889241 0.915574 0.85 {0: 1.33, 1: 1} 7e-05 rbf {'kernel': 'rbf', 'gamma': 7e-05, 'class_weigh... 124 ... 0.907240 0.917514 0.884615 0.917514 0.873303 0.916949 0.012624 0.003171 0.012640 0.002570
4 0.353906 0.075184 0.891953 0.920208 0.85 {0: 1.33, 1: 1} 9e-05 rbf {'kernel': 'rbf', 'gamma': 9e-05, 'class_weigh... 55 ... 0.902715 0.924294 0.884615 0.922034 0.877828 0.918644 0.007083 0.001746 0.011165 0.002548
5 0.350159 0.074468 0.892857 0.924841 0.85 {0: 1.33, 1: 1} 0.00011 rbf {'kernel': 'rbf', 'gamma': 0.00011, 'class_wei... 37 ... 0.902715 0.926554 0.880090 0.927119 0.877828 0.922599 0.004887 0.001453 0.013739 0.001945
6 0.349687 0.075331 0.890145 0.930831 0.85 {0: 1.33, 1: 1} 0.00013 rbf {'kernel': 'rbf', 'gamma': 0.00013, 'class_wei... 100 ... 0.902715 0.935593 0.875566 0.931638 0.875566 0.928249 0.002590 0.001693 0.015527 0.003002
7 0.352193 0.075244 0.890597 0.937273 0.85 {0: 1.33, 1: 1} 0.00015 rbf {'kernel': 'rbf', 'gamma': 0.00015, 'class_wei... 89 ... 0.898190 0.940113 0.877828 0.938418 0.877828 0.935593 0.005976 0.000707 0.014850 0.002770
8 0.351918 0.076226 0.890145 0.943715 0.85 {0: 1.33, 1: 1} 0.00017 rbf {'kernel': 'rbf', 'gamma': 0.00017, 'class_wei... 100 ... 0.895928 0.946893 0.880090 0.946893 0.875566 0.939548 0.003888 0.001631 0.013724 0.003263
9 0.357914 0.077536 0.892857 0.949140 0.85 {0: 1.33, 1: 1} 0.00019 rbf {'kernel': 'rbf', 'gamma': 0.00019, 'class_wei... 37 ... 0.895928 0.951977 0.884615 0.953672 0.880090 0.944068 0.005477 0.001646 0.011818 0.003673
10 0.363275 0.079470 0.894665 0.953774 0.85 {0: 1.33, 1: 1} 0.00021 rbf {'kernel': 'rbf', 'gamma': 0.00021, 'class_wei... 2 ... 0.893665 0.955932 0.893665 0.957062 0.884615 0.948588 0.006450 0.002137 0.009169 0.003426
11 0.485079 0.106174 0.876582 0.886303 0.86 {0: 1.33, 1: 1} 1e-05 rbf {'kernel': 'rbf', 'gamma': 1e-05, 'class_weigh... 183 ... 0.880090 0.881356 0.873303 0.885311 0.857466 0.891525 0.007442 0.000637 0.011687 0.003473
12 0.386279 0.085589 0.884268 0.900543 0.86 {0: 1.33, 1: 1} 3e-05 rbf {'kernel': 'rbf', 'gamma': 3e-05, 'class_weigh... 163 ... 0.904977 0.897740 0.873303 0.900565 0.871041 0.903955 0.002964 0.004466 0.013607 0.003600
13 0.412019 0.091459 0.886528 0.907663 0.86 {0: 1.33, 1: 1} 5e-05 rbf {'kernel': 'rbf', 'gamma': 5e-05, 'class_weigh... 151 ... 0.907240 0.907345 0.882353 0.909040 0.871041 0.908475 0.063577 0.013460 0.012674 0.001683
14 0.501851 0.108831 0.888788 0.915800 0.86 {0: 1.33, 1: 1} 7e-05 rbf {'kernel': 'rbf', 'gamma': 7e-05, 'class_weigh... 131 ... 0.907240 0.918079 0.882353 0.918079 0.873303 0.916949 0.028417 0.036872 0.012836 0.002749
15 0.535873 0.107396 0.891953 0.920547 0.86 {0: 1.33, 1: 1} 9e-05 rbf {'kernel': 'rbf', 'gamma': 9e-05, 'class_weigh... 55 ... 0.902715 0.924294 0.884615 0.922034 0.877828 0.919774 0.062958 0.021096 0.011165 0.002370
16 0.362678 0.076567 0.892857 0.925519 0.86 {0: 1.33, 1: 1} 0.00011 rbf {'kernel': 'rbf', 'gamma': 0.00011, 'class_wei... 37 ... 0.902715 0.927684 0.880090 0.927684 0.877828 0.922599 0.016281 0.003425 0.013739 0.002233
17 0.349087 0.075353 0.890145 0.931170 0.86 {0: 1.33, 1: 1} 0.00013 rbf {'kernel': 'rbf', 'gamma': 0.00013, 'class_wei... 100 ... 0.902715 0.936158 0.875566 0.932203 0.875566 0.928249 0.002904 0.002170 0.015527 0.003080
18 0.348154 0.074966 0.890597 0.937499 0.86 {0: 1.33, 1: 1} 0.00015 rbf {'kernel': 'rbf', 'gamma': 0.00015, 'class_wei... 89 ... 0.898190 0.940113 0.877828 0.938983 0.877828 0.936158 0.004562 0.000820 0.014850 0.002762
19 0.355929 0.076138 0.890597 0.944167 0.86 {0: 1.33, 1: 1} 0.00017 rbf {'kernel': 'rbf', 'gamma': 0.00017, 'class_wei... 89 ... 0.895928 0.947458 0.880090 0.947458 0.875566 0.940113 0.008124 0.002002 0.013588 0.003219
20 0.362350 0.077477 0.893309 0.949366 0.86 {0: 1.33, 1: 1} 0.00019 rbf {'kernel': 'rbf', 'gamma': 0.00019, 'class_wei... 29 ... 0.895928 0.952542 0.886878 0.953672 0.880090 0.944633 0.016530 0.002222 0.011534 0.003614
21 0.364886 0.078124 0.894213 0.954565 0.86 {0: 1.33, 1: 1} 0.00021 rbf {'kernel': 'rbf', 'gamma': 0.00021, 'class_wei... 6 ... 0.893665 0.957062 0.895928 0.957062 0.884615 0.950282 0.006267 0.001469 0.007560 0.003077
22 0.485646 0.107777 0.876582 0.886303 0.87 {0: 1.33, 1: 1} 1e-05 rbf {'kernel': 'rbf', 'gamma': 1e-05, 'class_weigh... 183 ... 0.880090 0.881356 0.873303 0.884746 0.857466 0.891525 0.013046 0.004606 0.011687 0.003596
23 0.386740 0.082835 0.884268 0.900656 0.87 {0: 1.33, 1: 1} 3e-05 rbf {'kernel': 'rbf', 'gamma': 3e-05, 'class_weigh... 163 ... 0.904977 0.897740 0.875566 0.901130 0.868778 0.903955 0.003106 0.000673 0.013757 0.003608
24 0.362153 0.077962 0.886528 0.908228 0.87 {0: 1.33, 1: 1} 5e-05 rbf {'kernel': 'rbf', 'gamma': 5e-05, 'class_weigh... 151 ... 0.907240 0.908475 0.882353 0.909040 0.871041 0.909605 0.008094 0.001478 0.012674 0.001897
25 0.357825 0.074856 0.888788 0.916252 0.87 {0: 1.33, 1: 1} 7e-05 rbf {'kernel': 'rbf', 'gamma': 7e-05, 'class_weigh... 131 ... 0.907240 0.918079 0.882353 0.919774 0.873303 0.916949 0.014739 0.000388 0.012836 0.002897
26 0.349499 0.075018 0.891501 0.920660 0.87 {0: 1.33, 1: 1} 9e-05 rbf {'kernel': 'rbf', 'gamma': 9e-05, 'class_weigh... 76 ... 0.902715 0.924859 0.884615 0.922034 0.875566 0.919774 0.008123 0.001843 0.011757 0.002553
27 0.349411 0.073545 0.892857 0.926084 0.87 {0: 1.33, 1: 1} 0.00011 rbf {'kernel': 'rbf', 'gamma': 0.00011, 'class_wei... 37 ... 0.902715 0.928814 0.880090 0.927684 0.877828 0.923164 0.009098 0.000617 0.013739 0.002455
28 0.350789 0.074659 0.889693 0.931735 0.87 {0: 1.33, 1: 1} 0.00013 rbf {'kernel': 'rbf', 'gamma': 0.00013, 'class_wei... 110 ... 0.900452 0.936723 0.875566 0.933333 0.875566 0.928814 0.003453 0.001541 0.015183 0.003264
29 0.351878 0.075582 0.890597 0.937951 0.87 {0: 1.33, 1: 1} 0.00015 rbf {'kernel': 'rbf', 'gamma': 0.00015, 'class_wei... 89 ... 0.898190 0.940113 0.877828 0.939548 0.877828 0.936723 0.003958 0.001245 0.014850 0.002662
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
157 0.444571 0.089726 0.888788 0.917608 0.99 {0: 1.33, 1: 1} 7e-05 rbf {'kernel': 'rbf', 'gamma': 7e-05, 'class_weigh... 131 ... 0.902715 0.920339 0.882353 0.919774 0.873303 0.918079 0.083508 0.014550 0.011862 0.002369
158 0.413661 0.078748 0.892405 0.923711 0.99 {0: 1.33, 1: 1} 9e-05 rbf {'kernel': 'rbf', 'gamma': 9e-05, 'class_weigh... 45 ... 0.902715 0.927119 0.882353 0.925424 0.877828 0.920904 0.036201 0.008534 0.011993 0.002496
159 0.360614 0.075561 0.891953 0.929475 0.99 {0: 1.33, 1: 1} 0.00011 rbf {'kernel': 'rbf', 'gamma': 0.00011, 'class_wei... 55 ... 0.902715 0.932768 0.877828 0.929944 0.880090 0.927119 0.011327 0.004267 0.014065 0.002174
160 0.362758 0.079489 0.890597 0.936030 0.99 {0: 1.33, 1: 1} 0.00013 rbf {'kernel': 'rbf', 'gamma': 0.00013, 'class_wei... 89 ... 0.900452 0.938418 0.877828 0.936723 0.875566 0.933898 0.009160 0.011340 0.015524 0.002634
161 0.388988 0.077340 0.891049 0.943941 0.99 {0: 1.33, 1: 1} 0.00015 rbf {'kernel': 'rbf', 'gamma': 0.00015, 'class_wei... 83 ... 0.893665 0.946893 0.875566 0.946328 0.882353 0.940678 0.027769 0.003906 0.013942 0.002747
162 0.375911 0.076609 0.891953 0.949931 0.99 {0: 1.33, 1: 1} 0.00017 rbf {'kernel': 'rbf', 'gamma': 0.00017, 'class_wei... 55 ... 0.893665 0.952542 0.880090 0.953107 0.882353 0.946328 0.013229 0.003802 0.012065 0.003017
163 0.376194 0.076933 0.894213 0.955695 0.99 {0: 1.33, 1: 1} 0.00019 rbf {'kernel': 'rbf', 'gamma': 0.00019, 'class_wei... 6 ... 0.895928 0.958757 0.889140 0.958192 0.884615 0.951412 0.008439 0.002787 0.009481 0.003299
164 0.419080 0.085645 0.894213 0.959312 0.99 {0: 1.33, 1: 1} 0.00021 rbf {'kernel': 'rbf', 'gamma': 0.00021, 'class_wei... 6 ... 0.893665 0.964972 0.889140 0.961017 0.891403 0.953107 0.025396 0.010956 0.005569 0.004251
165 0.561820 0.114892 0.878391 0.887433 1 {0: 1.33, 1: 1} 1e-05 rbf {'kernel': 'rbf', 'gamma': 1e-05, 'class_weigh... 171 ... 0.884615 0.883616 0.873303 0.886441 0.861991 0.890960 0.035033 0.015328 0.010643 0.003193
166 0.482596 0.100246 0.883816 0.902239 1 {0: 1.33, 1: 1} 3e-05 rbf {'kernel': 'rbf', 'gamma': 3e-05, 'class_weigh... 168 ... 0.904977 0.900000 0.875566 0.901130 0.868778 0.906215 0.016734 0.010950 0.014051 0.003113
167 0.463714 0.096052 0.890145 0.911280 1 {0: 1.33, 1: 1} 5e-05 rbf {'kernel': 'rbf', 'gamma': 5e-05, 'class_weigh... 100 ... 0.914027 0.913559 0.884615 0.911864 0.875566 0.912994 0.022301 0.008897 0.014257 0.003145
168 0.413456 0.095846 0.888788 0.917834 1 {0: 1.33, 1: 1} 7e-05 rbf {'kernel': 'rbf', 'gamma': 7e-05, 'class_weigh... 131 ... 0.902715 0.920904 0.882353 0.919774 0.873303 0.918079 0.025419 0.004453 0.011862 0.002349
169 0.383512 0.073859 0.891953 0.923824 1 {0: 1.33, 1: 1} 9e-05 rbf {'kernel': 'rbf', 'gamma': 9e-05, 'class_weigh... 55 ... 0.902715 0.927119 0.880090 0.925989 0.877828 0.920904 0.038308 0.001840 0.012399 0.002754
170 0.407128 0.084379 0.891953 0.930040 1 {0: 1.33, 1: 1} 0.00011 rbf {'kernel': 'rbf', 'gamma': 0.00011, 'class_wei... 55 ... 0.902715 0.933333 0.877828 0.930508 0.880090 0.928814 0.038685 0.010551 0.014065 0.002089
171 0.415157 0.091723 0.891501 0.936369 1 {0: 1.33, 1: 1} 0.00013 rbf {'kernel': 'rbf', 'gamma': 0.00013, 'class_wei... 76 ... 0.900452 0.938983 0.877828 0.936723 0.877828 0.935028 0.020458 0.007832 0.014888 0.002592
172 0.422001 0.087517 0.891501 0.944394 1 {0: 1.33, 1: 1} 0.00015 rbf {'kernel': 'rbf', 'gamma': 0.00015, 'class_wei... 76 ... 0.893665 0.946893 0.877828 0.946893 0.882353 0.941243 0.041294 0.012433 0.013461 0.002453
173 0.416073 0.089631 0.892857 0.950383 1 {0: 1.33, 1: 1} 0.00017 rbf {'kernel': 'rbf', 'gamma': 0.00017, 'class_wei... 37 ... 0.893665 0.953107 0.882353 0.953107 0.884615 0.946893 0.046913 0.012780 0.011286 0.002904
174 0.415082 0.085842 0.893761 0.956261 1 {0: 1.33, 1: 1} 0.00019 rbf {'kernel': 'rbf', 'gamma': 0.00019, 'class_wei... 19 ... 0.895928 0.959887 0.889140 0.958192 0.884615 0.951977 0.037486 0.011709 0.008651 0.003138
175 0.404942 0.082702 0.894665 0.960103 1 {0: 1.33, 1: 1} 0.00021 rbf {'kernel': 'rbf', 'gamma': 0.00021, 'class_wei... 2 ... 0.895928 0.966102 0.889140 0.962147 0.891403 0.953672 0.022592 0.004818 0.005598 0.004607
176 0.486412 0.103226 0.878391 0.887885 1.01 {0: 1.33, 1: 1} 1e-05 rbf {'kernel': 'rbf', 'gamma': 1e-05, 'class_weigh... 171 ... 0.884615 0.883616 0.873303 0.887571 0.861991 0.891525 0.010723 0.001293 0.010643 0.003409
177 0.451571 0.094131 0.883816 0.902465 1.01 {0: 1.33, 1: 1} 3e-05 rbf {'kernel': 'rbf', 'gamma': 3e-05, 'class_weigh... 168 ... 0.904977 0.900000 0.875566 0.901130 0.868778 0.906215 0.041644 0.011842 0.014051 0.002849
178 0.427228 0.086585 0.890145 0.911280 1.01 {0: 1.33, 1: 1} 5e-05 rbf {'kernel': 'rbf', 'gamma': 5e-05, 'class_weigh... 100 ... 0.914027 0.913559 0.884615 0.911864 0.875566 0.913559 0.032641 0.008355 0.014257 0.003169
179 0.370701 0.078460 0.889241 0.917947 1.01 {0: 1.33, 1: 1} 7e-05 rbf {'kernel': 'rbf', 'gamma': 7e-05, 'class_weigh... 124 ... 0.902715 0.920904 0.882353 0.919774 0.873303 0.918079 0.024616 0.010087 0.012414 0.002204
180 0.358549 0.072419 0.891953 0.923937 1.01 {0: 1.33, 1: 1} 9e-05 rbf {'kernel': 'rbf', 'gamma': 9e-05, 'class_weigh... 55 ... 0.902715 0.927119 0.880090 0.925989 0.877828 0.920904 0.011928 0.000619 0.012399 0.002613
181 0.364770 0.077377 0.891501 0.930266 1.01 {0: 1.33, 1: 1} 0.00011 rbf {'kernel': 'rbf', 'gamma': 0.00011, 'class_wei... 76 ... 0.900452 0.933333 0.877828 0.931073 0.880090 0.928814 0.021795 0.009686 0.013745 0.002157
182 0.350404 0.073023 0.890597 0.936708 1.01 {0: 1.33, 1: 1} 0.00013 rbf {'kernel': 'rbf', 'gamma': 0.00013, 'class_wei... 89 ... 0.900452 0.938983 0.873303 0.937288 0.877828 0.935593 0.005686 0.000777 0.015800 0.002676
183 0.357588 0.076488 0.891049 0.944959 1.01 {0: 1.33, 1: 1} 0.00015 rbf {'kernel': 'rbf', 'gamma': 0.00015, 'class_wei... 83 ... 0.893665 0.947458 0.875566 0.946893 0.882353 0.942938 0.009768 0.005555 0.013942 0.002239
184 0.360025 0.074841 0.892857 0.950948 1.01 {0: 1.33, 1: 1} 0.00017 rbf {'kernel': 'rbf', 'gamma': 0.00017, 'class_wei... 37 ... 0.893665 0.954237 0.884615 0.954237 0.884615 0.946893 0.014623 0.001618 0.010039 0.003393
185 0.384417 0.076336 0.893309 0.956600 1.01 {0: 1.33, 1: 1} 0.00019 rbf {'kernel': 'rbf', 'gamma': 0.00019, 'class_wei... 29 ... 0.895928 0.959887 0.889140 0.958757 0.884615 0.951977 0.042491 0.002015 0.007837 0.003174
186 0.379126 0.079808 0.894665 0.960668 1.01 {0: 1.33, 1: 1} 0.00021 rbf {'kernel': 'rbf', 'gamma': 0.00021, 'class_wei... 2 ... 0.895928 0.966102 0.889140 0.962712 0.893665 0.954802 0.013541 0.006596 0.005739 0.004306

187 rows × 24 columns

The cell below will plot two heat-maps side by side: one for showing how the training accuracy changes during cross-validation for different combinations of parameters, and one for showing how the testing accuracy changes during cross-validation for different combinations of parameters.


In [31]:
#Draw heatmap of the validation accuracy as a function of gamma and C
fig = plt.figure(figsize=(10, 10))
ix=fig.add_subplot(1,2,1)
val_scores = clf.cv_results_['mean_test_score'].reshape(len(original_c_range),len(gamma_range))
val_scores

ax = sns.heatmap(val_scores, linewidths=0.5, square=True, cmap='PuBuGn', 
                 xticklabels=gamma_range, yticklabels=original_c_range, cbar_kws={'shrink':0.5})
ax.invert_yaxis()
plt.yticks(rotation=0, fontsize=10)
plt.xticks(rotation= 70,fontsize=10)
plt.xlabel('Gamma', fontsize=15)
plt.ylabel('C', fontsize=15)
plt.title('Validation Accuracy', fontsize=15)

#Draw heatmap of the validation accuracy as a function of gamma and C
ix=fig.add_subplot(1,2,2)
train_scores = clf.cv_results_['mean_train_score'].reshape(len(original_c_range),len(gamma_range))
train_scores
#plt.figure(figsize=(6, 6))
ax_1 = sns.heatmap(train_scores, linewidths=0.5, square=True, cmap='PuBuGn', 
                 xticklabels=gamma_range, yticklabels=original_c_range, cbar_kws={'shrink':0.5})
ax_1.invert_yaxis()
plt.yticks(rotation=0, fontsize=10)
plt.xticks(rotation= 70,fontsize=10)
plt.xlabel('Gamma', fontsize=15)
plt.ylabel('C', fontsize=15)
plt.title('Training Accuracy', fontsize=15)
plt.show()


The cells below will plot a Validation Curves for Gamma.


In [32]:
#import module/library 
from sklearn.model_selection import validation_curve
import matplotlib.pyplot as plt
%matplotlib inline

In [33]:
#specifying gamma parameter range to plot for validation curve 
param_range = gamma_range
param_range


Out[33]:
array([  1.00000000e-05,   3.00000000e-05,   5.00000000e-05,
         7.00000000e-05,   9.00000000e-05,   1.10000000e-04,
         1.30000000e-04,   1.50000000e-04,   1.70000000e-04,
         1.90000000e-04,   2.10000000e-04])

In [34]:
#calculating train and validation scores 
train_scores, valid_scores = validation_curve(SVC(C=0.92, kernel='rbf', class_weight={0:1.33, 1:1}), std_conf_ft_in_pca, conf_ft_outputs, param_name='gamma',param_range=param_range,scoring='accuracy')
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)

In [35]:
#plotting validation curve  
plt.title('Gamma Validation Curve for SVM With RBF Kernel | C=0.92')
plt.xlabel('Gamma')
plt.ylabel('Score')
plt.xticks(rotation=70)
plt.ylim(0.8,1.0)
plt.xlim(0.0001,0.00021)
plt.xticks(param_range)
lw=2
plt.plot(param_range, train_scores_mean, 'o-',label="Training Score", color='darkorange', lw=lw)
plt.fill_between(param_range, train_scores_mean-train_scores_std, train_scores_mean+train_scores_std, alpha=0.2, color='darkorange', lw=lw)
plt.plot(param_range, valid_scores_mean, 'o-',label="Testing Score", color='navy', lw=lw)
plt.fill_between(param_range, valid_scores_mean-valid_scores_std, valid_scores_mean+valid_scores_std, alpha=0.2, color='navy', lw=lw)
plt.legend(loc='best')
plt.show()


The cells below will plot the Learning Curve.


In [36]:
#import module/library 
from sklearn.model_selection import learning_curve

In [37]:
#define training data size increments 
td_size = np.arange(0.1, 1.1, 0.1)
#calculating train and validation scores
train_sizes, train_scores, valid_scores = learning_curve(SVC(C=0.92, kernel='rbf', gamma=0.00011, class_weight={0:1.33, 1:1}), std_conf_ft_in_pca, conf_ft_outputs, train_sizes=td_size ,scoring='accuracy')
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)

In [38]:
#plotting learning curve 
fig = plt.figure(figsize=(5,5))
plt.title('Learning Curve with SVM with RBF Kernel| C=0.92 & Gamma = 0.00011', fontsize=9)
plt.xlabel('Train Data Size')
plt.ylabel('Score')
plt.ylim(0.8,1)
lw=2
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training Score")
plt.fill_between(train_sizes, train_scores_mean-train_scores_std, train_scores_mean+train_scores_std, alpha=0.2, color='red', lw=lw)
plt.plot(train_sizes, valid_scores_mean, 'o-', color="g",label="Testing Score")
plt.fill_between(train_sizes, valid_scores_mean-valid_scores_std, valid_scores_mean+valid_scores_std, alpha=0.2, color='green', lw=lw)
plt.legend(loc='best')
plt.show()


Finding Best Number of Principal Components

The cells below will show the optimisation for the number of principal components to include. This is done by doing using a range of principal components, conducting PCA for each specified number in the interval and calculating the average of the test score over 3-fold cross-validation. This procedure is repeated 5 times to combat the randomness of PCA. The average test accuracy over the 5 runs is then plotted against the number of principal components included.


In [39]:
#this cell may take several minutes to run 
#plot how the number of PC's changes the test accuracy
no_pcs = np.arange(20, 310, 10)
compute_average_of_5 = []
for t in range(0,5):
    pcs_accuracy_change = []
    for i in no_pcs:
        dummy_inputs = std_conf_ft_in
        dummy_outputs = conf_ft_outputs
        pca_dummy = PCA(n_components=i,)
        pca_dummy.fit(dummy_inputs)
        dummy_inputs_pca = pca_dummy.transform(dummy_inputs)
        dummy_model = SVC(C=0.92, kernel='rbf', gamma=0.00011, class_weight={0:1.33, 1:1})
        dummy_model.fit(dummy_inputs_pca, dummy_outputs,)
        dummy_scores = cross_val_score(dummy_model, dummy_inputs_pca, dummy_outputs, cv=3, scoring='accuracy')
        mean_cv = dummy_scores.mean()
        pcs_accuracy_change.append(mean_cv) 
    print (len(pcs_accuracy_change))
    compute_average_of_5.append(pcs_accuracy_change)


29
29
29
29
29

In [40]:
#calculate position specific average for the five trials 
from __future__ import division
average_acc_4_pcs = [sum(e)/len(e) for e in zip(*compute_average_of_5)]

In [41]:
plt.title('Number of PCs and Change In Accuracy')
plt.xlabel('Number of PCs')
plt.ylabel('Accuracy (%)')
plt.plot(no_pcs, average_acc_4_pcs, 'o-', color="r")
plt.show()


Making Predictions

The following cells will prepare the test data by getting it into the right format.


In [43]:
#Load the complete training data set 
test_data = pd.read_csv("/Users/Max/Desktop/Max's Folder/Uni Work/Data Science MSc/Machine Learning/ML Kaggle Competition /Data Sets/Testing Data Set.csv", header=0, index_col=0)

In [44]:
##Observe the test data 
test_data


Out[44]:
CNNs CNNs.1 CNNs.2 CNNs.3 CNNs.4 CNNs.5 CNNs.6 CNNs.7 CNNs.8 CNNs.9 ... GIST.502 GIST.503 GIST.504 GIST.505 GIST.506 GIST.507 GIST.508 GIST.509 GIST.510 GIST.511
ID
1 0.194830 1.350300 0.213490 0.000000 0.000000 0.000000 0.351890 0.088491 0.000000 0.019326 ... 0.034106 0.033771 0.033252 0.065845 0.032537 0.013666 0.017434 0.019322 0.022847 0.018033
2 0.000000 0.000000 0.000000 0.165510 0.000000 0.000000 0.387750 0.000000 0.000000 0.000000 ... 0.016437 0.016466 0.027004 0.033501 0.022096 0.017171 0.008196 0.009801 0.024652 0.022242
3 0.000000 0.200780 0.000000 0.000000 2.094100 0.000000 0.299910 0.378720 0.075307 0.000000 ... 0.008704 0.009539 0.024631 0.008418 0.004711 0.005842 0.012716 0.026749 0.018274 0.019011
4 0.000000 0.347870 0.000000 0.073645 0.000000 0.000000 0.000000 0.000000 0.000000 1.229500 ... 0.017548 0.014710 0.024185 0.014486 0.009172 0.023696 0.012986 0.030495 0.043319 0.016503
5 0.884630 0.324790 0.000000 0.000000 0.000000 0.000000 0.011088 0.000000 0.000000 0.482680 ... 0.022969 0.044128 0.002334 0.019548 0.021441 0.021289 0.044368 0.043576 0.036581 0.029603
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.166160 1.004200 0.000000 0.189100 ... 0.043562 0.021703 0.026738 0.031587 0.034307 0.030655 0.038916 0.034066 0.045780 0.047206
7 0.000000 0.000000 0.000000 0.000000 0.325900 0.000000 0.687090 0.618480 0.447820 0.015397 ... 0.004481 0.001124 0.000679 0.000611 0.004346 0.001564 0.003356 0.003834 0.011894 0.003052
8 0.000000 0.364670 0.000000 0.101220 0.724070 0.102940 0.000000 0.000000 0.000000 0.408030 ... 0.019751 0.026489 0.022905 0.019998 0.023690 0.042806 0.011671 0.010686 0.019967 0.039470
9 0.000000 0.000000 0.782400 0.000000 0.000000 0.000000 0.647210 0.000000 0.000000 0.016120 ... 0.040826 0.025450 0.030382 0.029829 0.039562 0.007886 0.032521 0.016721 0.038936 0.028714
10 0.000000 0.150050 0.000000 0.000000 0.000000 0.307400 0.000000 0.000000 0.000000 0.000000 ... 0.031852 0.047037 0.012018 0.006477 0.028726 0.022898 0.014857 0.006256 0.007239 0.023779
11 0.000000 0.000000 0.000000 0.000000 0.168120 0.000000 0.557410 0.357510 0.000000 0.000000 ... 0.048258 0.032145 0.035011 0.068682 0.039521 0.036103 0.049371 0.056364 0.047784 0.029587
12 0.000000 0.000000 0.000000 1.113900 0.000000 0.000000 0.654960 0.368960 0.116540 0.672640 ... 0.014656 0.022645 0.009311 0.018720 0.014527 0.034215 0.004362 0.020660 0.013758 0.016608
13 0.152170 0.000000 0.054608 0.000000 0.000000 0.000000 0.022096 0.000000 0.853110 1.165300 ... 0.062972 0.040143 0.016354 0.045206 0.044716 0.023516 0.009517 0.023491 0.051504 0.034649
14 0.000000 0.000000 0.000000 0.437360 0.085099 1.019300 0.202030 0.202820 0.911960 0.000000 ... 0.029723 0.025426 0.002588 0.009861 0.024655 0.016447 0.003662 0.010975 0.035349 0.011683
15 0.213150 0.706040 0.000000 0.000000 0.000000 0.000000 0.861110 0.387440 0.000000 0.000000 ... 0.036866 0.053719 0.021971 0.025420 0.035830 0.042621 0.049735 0.059815 0.062077 0.078901
16 0.367090 0.198290 0.399630 1.044100 1.062400 0.739550 0.555000 0.728430 0.549300 0.395800 ... 0.017267 0.005520 0.005491 0.008309 0.012251 0.007130 0.009657 0.017222 0.007176 0.010277
17 1.081300 0.709940 0.000000 0.000000 0.000000 0.038946 0.000000 0.000000 0.621500 0.586920 ... 0.017681 0.015070 0.002573 0.014901 0.016915 0.015710 0.003059 0.006623 0.047580 0.022572
18 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.433450 0.000000 0.000000 ... 0.039709 0.004613 0.002927 0.030539 0.032032 0.004746 0.006050 0.016470 0.024361 0.006372
19 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.628250 0.082205 0.000000 0.000000 ... 0.010681 0.005907 0.006614 0.013034 0.016809 0.006511 0.013259 0.029419 0.021673 0.005455
20 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.316730 0.530300 0.000000 0.000000 ... 0.006544 0.008341 0.001190 0.020739 0.004576 0.007154 0.003611 0.027442 0.006166 0.006204
21 0.177610 1.023500 0.000000 0.253690 0.782780 0.000000 0.000000 0.000000 0.000000 0.929450 ... 0.027354 0.017965 0.035577 0.011575 0.017646 0.035258 0.031280 0.034345 0.009078 0.014279
22 0.000000 0.490950 0.000000 0.000000 0.000000 0.000000 0.000000 1.196700 0.000000 1.791300 ... 0.014434 0.006604 0.058954 0.019730 0.018328 0.013492 0.063643 0.020623 0.010762 0.013761
23 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.380200 0.000000 ... 0.012042 0.017029 0.032655 0.031286 0.020286 0.022929 0.038708 0.040450 0.014474 0.034527
24 0.000000 0.000000 0.000000 0.397890 0.000000 0.396670 0.000000 0.933390 0.000000 0.000000 ... 0.010552 0.003758 0.013717 0.034272 0.024939 0.008811 0.007765 0.008986 0.002322 0.001472
25 0.609620 0.000000 1.208400 0.000000 0.441790 0.000000 0.852390 0.703310 0.046152 0.437660 ... 0.041628 0.021789 0.009921 0.028715 0.037731 0.013175 0.009148 0.018026 0.038273 0.032657
26 0.039763 0.350370 0.030108 0.031349 0.701510 0.000000 1.393400 0.616590 0.820280 0.140160 ... 0.004995 0.016679 0.006073 0.003541 0.013030 0.019313 0.001189 0.000711 0.010844 0.018480
27 0.000000 0.000000 0.000000 0.727110 0.000000 0.400440 0.000000 0.694860 0.000000 0.513950 ... 0.009220 0.003783 0.006770 0.008080 0.009801 0.005275 0.009081 0.012781 0.010865 0.007519
28 0.000000 0.000000 0.000000 0.428450 0.270870 0.772110 0.000000 0.342840 0.000000 0.576240 ... 0.042825 0.034460 0.049683 0.069336 0.041269 0.041437 0.049926 0.059883 0.056016 0.049831
29 0.071236 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.033911 0.017138 0.044084 0.026645 0.033558 0.051734 0.029514 0.031265 0.039061 0.039235
30 0.000000 1.060000 0.000000 0.000000 0.000000 0.000000 0.049577 1.533800 0.000000 0.000000 ... 0.027339 0.032716 0.021481 0.031347 0.029081 0.033590 0.028272 0.038660 0.032977 0.015283
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4171 0.000000 2.734600 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.411400 0.000000 ... 0.042138 0.011998 0.026259 0.041903 0.060205 0.031540 0.052135 0.037284 0.061335 0.047338
4172 0.786940 0.651490 1.183200 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.070863 0.040415 0.053479 0.059165 0.061798 0.071912 0.040353 0.070295 0.030428 0.035420
4173 0.000000 1.991300 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.073568 0.017017 0.043290 0.044839 0.057458 0.039226 0.048257 0.044225 0.056488 0.037646
4174 0.000000 0.000000 0.000000 0.000000 0.000000 1.326700 0.000000 0.000000 0.000000 0.106750 ... 0.017138 0.013441 0.024706 0.025719 0.038360 0.017907 0.032489 0.024224 0.022035 0.028300
4175 1.095200 1.567200 0.000000 0.000000 0.000000 0.000000 0.070119 0.000000 0.000000 0.000000 ... 0.042875 0.044804 0.037184 0.068091 0.103290 0.039282 0.023520 0.043291 0.055544 0.043498
4176 0.000000 0.302120 0.000000 0.384000 0.000000 0.000000 0.000000 0.397090 0.000000 0.319180 ... 0.048888 0.039039 0.039669 0.037022 0.037721 0.031369 0.035511 0.038268 0.042715 0.013048
4177 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.366940 1.154200 0.000000 0.000000 ... 0.032254 0.005745 0.026515 0.029190 0.027137 0.007291 0.042835 0.043519 0.026551 0.011185
4178 0.000000 0.061280 0.000000 0.000000 0.000000 0.000000 0.000000 0.138560 0.000000 1.694200 ... 0.016446 0.012421 0.038739 0.032971 0.025489 0.053939 0.043233 0.061656 0.038132 0.027813
4179 0.000000 0.569260 0.000000 0.000000 0.000000 0.000000 0.000000 0.739310 0.749620 0.000000 ... 0.037481 0.008475 0.022053 0.027696 0.024141 0.010281 0.021287 0.040340 0.041724 0.008595
4180 0.000000 0.000000 0.000000 0.000000 0.494700 0.000000 0.000000 0.777980 0.000000 0.000000 ... 0.016469 0.015347 0.031718 0.027136 0.027073 0.017448 0.023853 0.015471 0.019030 0.019459
4181 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.317400 0.000000 ... 0.044988 0.021043 0.041841 0.014692 0.021267 0.032977 0.033808 0.018784 0.018042 0.019031
4182 0.000000 0.410600 1.555400 1.061000 0.165900 0.000000 0.455830 0.000000 0.000000 0.000000 ... 0.031500 0.026408 0.046603 0.034717 0.035535 0.031554 0.029539 0.044249 0.023049 0.045432
4183 1.638500 0.000000 0.000000 0.000000 0.178880 0.000000 0.000000 0.000000 0.000000 0.710170 ... 0.026997 0.003131 0.022698 0.032954 0.027425 0.004165 0.040303 0.038697 0.020735 0.020643
4184 0.000000 0.000000 0.000000 1.645800 0.000000 0.000000 0.000000 0.000000 0.000000 0.003552 ... 0.032534 0.042982 0.042220 0.034992 0.028178 0.022535 0.050429 0.020034 0.023449 0.029159
4185 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.577500 0.000000 0.000000 ... 0.065182 0.039138 0.031415 0.034866 0.031182 0.052862 0.022938 0.036160 0.036301 0.042277
4186 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.182140 0.000000 0.000000 0.000000 ... 0.042382 0.031378 0.049776 0.063646 0.021612 0.015708 0.042415 0.053048 0.044192 0.037856
4187 0.000000 0.000000 0.000000 0.000000 0.023010 0.000000 0.000000 0.220640 1.228100 0.000000 ... 0.034456 0.002692 0.020770 0.053716 0.040541 0.002834 0.020102 0.034843 0.024613 0.005409
4188 0.000000 0.000000 0.000000 0.000000 0.216160 0.000000 0.048958 1.981900 0.000000 0.082497 ... 0.077684 0.059553 0.060726 0.059117 0.045136 0.037635 0.041499 0.039308 0.022825 0.032905
4189 0.000000 0.882690 0.429830 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.188320 ... 0.025227 0.033569 0.034830 0.031152 0.037935 0.059416 0.069511 0.062773 0.041863 0.035643
4190 0.000000 0.728680 0.000000 0.093624 0.687730 0.037979 0.000000 0.000000 0.000000 1.126700 ... 0.020896 0.015064 0.016455 0.026685 0.017372 0.010616 0.012713 0.023552 0.022268 0.016904
4191 0.000000 0.481380 0.000000 0.000000 0.000000 0.773610 0.000000 0.000000 0.000000 0.000000 ... 0.060230 0.038550 0.026339 0.028310 0.027146 0.034679 0.010065 0.010213 0.026589 0.048003
4192 0.000000 0.462310 0.000000 0.167100 0.000000 0.768900 0.716370 0.000000 0.000000 0.169180 ... 0.018414 0.021935 0.024465 0.026636 0.030005 0.028487 0.017633 0.019037 0.040453 0.021242
4193 0.000000 1.089200 0.000000 0.000000 0.000000 0.626980 1.125300 0.000000 0.000000 0.000000 ... 0.014440 0.005520 0.014940 0.030454 0.047470 0.016407 0.004174 0.037404 0.040862 0.014487
4194 0.139580 0.447620 0.192890 0.409270 0.000000 0.000000 0.000000 0.661030 0.000000 0.341400 ... 0.039888 0.041431 0.040149 0.021435 0.043586 0.081216 0.030188 0.045357 0.047763 0.048098
4195 0.000000 0.003217 0.000000 0.000000 0.000000 0.000000 0.000000 0.136060 0.197230 0.000000 ... 0.013453 0.002702 0.015512 0.036910 0.027067 0.002744 0.013099 0.050646 0.023090 0.005154
4196 0.527870 0.297660 0.331530 0.000000 0.000000 0.000000 0.574640 0.258230 0.481020 0.000000 ... 0.029119 0.027585 0.019465 0.020351 0.033806 0.032674 0.015499 0.039595 0.058310 0.059833
4197 0.000000 0.000000 1.155300 0.207210 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.029210 0.061133 0.033955 0.057905 0.045448 0.057810 0.039487 0.043194 0.062088 0.031589
4198 0.000000 0.117070 0.000000 0.000000 0.000000 0.000000 0.000000 1.578400 0.881420 1.545200 ... 0.007609 0.005828 0.004176 0.015774 0.024296 0.005301 0.005606 0.029187 0.035203 0.014258
4199 0.354620 0.000000 0.037829 0.000000 0.647960 0.000000 0.856870 0.687660 0.000000 1.021000 ... 0.023272 0.010051 0.016886 0.041528 0.010361 0.006299 0.017601 0.026010 0.015067 0.029002
4200 0.000000 1.894900 0.000000 0.000000 0.000000 0.719250 0.000000 0.000000 0.000000 0.000000 ... 0.034280 0.025630 0.023124 0.022218 0.034257 0.027829 0.020365 0.021096 0.035821 0.017233

4200 rows × 4608 columns


In [45]:
#turn test dataframe into matrix 
test_data_matrix = test_data.as_matrix(columns=None)
test_data_matrix.shape


Out[45]:
(4200, 4608)

The following cell will apply the same pre-processing applied to the training data to the test data.


In [46]:
#pre-process test data in same way as train data  
scaled_test = scaler_2.transform(test_data_matrix)
transformed_test = pca_2.transform(scaled_test)
transformed_test.shape


Out[46]:
(4200, 230)

The following cells will produce predictions on the test data using the final model.


In [47]:
#define and fit final model with best parameters from grid search
final_model = SVC(C=0.92, cache_size=1000, kernel='rbf', gamma=0.00011, class_weight={0:1.33, 1:1})
final_model.fit(std_conf_ft_in_pca, conf_ft_outputs)


Out[47]:
SVC(C=0.92, cache_size=1000, class_weight={0: 1.33, 1: 1}, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.00011, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [48]:
#make test data predictions
predictions = final_model.predict(transformed_test)

#create dictionary for outputs matched with ID
to_export = {'ID': np.arange(1, 4201, 1), 'prediction': predictions}
to_export

#convert to dataframe 
final_predictions = pd.DataFrame.from_dict(to_export)
final_predictions


Out[48]:
ID prediction
0 1 1.0
1 2 0.0
2 3 0.0
3 4 1.0
4 5 1.0
5 6 1.0
6 7 0.0
7 8 0.0
8 9 0.0
9 10 1.0
10 11 0.0
11 12 0.0
12 13 1.0
13 14 0.0
14 15 1.0
15 16 0.0
16 17 1.0
17 18 0.0
18 19 0.0
19 20 0.0
20 21 0.0
21 22 1.0
22 23 1.0
23 24 0.0
24 25 0.0
25 26 1.0
26 27 0.0
27 28 0.0
28 29 0.0
29 30 1.0
... ... ...
4170 4171 1.0
4171 4172 0.0
4172 4173 0.0
4173 4174 1.0
4174 4175 1.0
4175 4176 0.0
4176 4177 1.0
4177 4178 1.0
4178 4179 1.0
4179 4180 0.0
4180 4181 0.0
4181 4182 0.0
4182 4183 0.0
4183 4184 0.0
4184 4185 1.0
4185 4186 0.0
4186 4187 1.0
4187 4188 0.0
4188 4189 0.0
4189 4190 0.0
4190 4191 1.0
4191 4192 1.0
4192 4193 1.0
4193 4194 0.0
4194 4195 1.0
4195 4196 1.0
4196 4197 0.0
4197 4198 1.0
4198 4199 0.0
4199 4200 1.0

4200 rows × 2 columns


In [49]:
#convert prediction column float type entries to integers
final_predictions = final_predictions.astype('int')
final_predictions


Out[49]:
ID prediction
0 1 1
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 0
7 8 0
8 9 0
9 10 1
10 11 0
11 12 0
12 13 1
13 14 0
14 15 1
15 16 0
16 17 1
17 18 0
18 19 0
19 20 0
20 21 0
21 22 1
22 23 1
23 24 0
24 25 0
25 26 1
26 27 0
27 28 0
28 29 0
29 30 1
... ... ...
4170 4171 1
4171 4172 0
4172 4173 0
4173 4174 1
4174 4175 1
4175 4176 0
4176 4177 1
4177 4178 1
4178 4179 1
4179 4180 0
4180 4181 0
4181 4182 0
4182 4183 0
4183 4184 0
4184 4185 1
4185 4186 0
4186 4187 1
4187 4188 0
4188 4189 0
4189 4190 0
4190 4191 1
4191 4192 1
4192 4193 1
4193 4194 0
4194 4195 1
4195 4196 1
4196 4197 0
4197 4198 1
4198 4199 0
4199 4200 1

4200 rows × 2 columns


In [50]:
#check properties of predictions: class balance should be 42.86(1):57.14(0)
#i.e. should predict 2400 Class 0 instances, and 1800 Class 1 instances
final_predictions.prediction.value_counts()


Out[50]:
0    2470
1    1730
Name: prediction, dtype: int64

References

[1] Vapnik V. (1979) Estimation of Dependences based on Empirical Data. Springer Verlag, New York, 1982.

[2] Cortes C, Vapnik V. (1995) Support Vector Networks. Machine Learning. Vol. 20: pages 273-297.

[3] Drucker H, Burges CJC, Kaufman AS, Vapnik V. (1997) Support vector regression machines. Advances in Neural Information Processing Systems. Vol. 9: pages 155-161.

[4] Vapnik VN. (1982) Estimation of Dependences Based on Empirical Data. Addendum 1, New York: Springer-Verlag.

[5] Rosasco L, De Vito ED, Caponnetto A, Piana M, Verri A. (2004) Are Loss Functions All The Same? Neural Computation. Vol. 16: pages 1063-1076.

[6] Batuwita R, Palade V. (2012) Class Imbalance learning methods for Support Vector Machines. In: Imbalanced Learning: Foundations, Algorithms and Applications, by He H, Ma Y. John Wiley & Sons: Chapter 6.

[7] Lian H. (2012) On feature selection with principal component analysis for one-class SVM. Pattern Recognition Letters. Vol. 33: pages 1027-1031.

[8] Juszczak P, Tax DJ, Dui RW. (2002) Feature scaling in support vector data descriptions. Proc. 8th Annual. Conf. Adv. School Comput. Imaging: pages 1-8. Accessed on link: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.25 24&rep=rep1&type=pdf

[9] Greenland S, Finkle WD. (1995) A critical look at methods for handling missing covariates in epidemiologic regression analyses. AM J Epdimiol. Vol. 142: pages 1255-1264.

[10] Horton NJ, Kleinman KP. (2007) Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am Stat. Vol. 61: pages 79- 90.

[11] Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics. Vol. 17: pages 520-525.

[12] Andridge RR, Little RJ. (2010) A Review of Hot Deck Imputation for Survey Non-response. Int Stat Review. Vol. 78: pages 40-64.

[13] Rubinsteyn A, Feldman S, O’Donnell T, Beaulieu-Jones B. (2015) fancyimpute 0.2.0. Package found on: https://github.com/hammerlab/fancyimpute.

[14] Beretta L, Santaniello A. (2016) Nearest neighbour imputation algorithms: a critical evaluation. BMC Medical Informatics and Decision Making. Vol. 16: pages 197-208.

[15] Barandela R, Valdovinos RM, Sanchez JS, Ferri Fj. (2004) The Imbalanced Training Sample Problem: Under or Over Sampling? Spring-Verlag, Berlin: pages 806-814.

[16] Lematre G, Nogueira F, Adrias CK. (2017) Imbalanced- learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research. Vol. 18: pages 1-5.

[17] Chawla NV, Bowyer KW, Hall Lo, Kegelmeyer WP. (2002) SMOTE: Synthetic minority oversampling. Journal of Artifical Intelligence Research. Vol. 16: pages 321-357.

[18] Strong DM, Lee YW, Wang RY. (1997) Data Quality in context. Communications of the ACM. Vol. 40: pages 103-110.

[19] Blum AL, Langley P. (1997) Selection of relevant features and examples in Machine Learning. Artificial Intelligence. Vol. 97: pages 245-271.

[20] Hira ZM, Gillies DF. (2015) A Review of Feature Selection and Feature Extraction Methods Applied on Microarray data. Advances in Bioinformatics. Vol. 2015: pages 1-13.